重复URL检查哪种哈希算法?(Which hash algorithm for duplicate URLs checking?)

我正在将URL保存在数据库中,当我插入新的URL时,我想检查数据库中是否已存在该URL。

一种常见的做法(如果我没有记错的话)是使用md5或sha-1等来对网址进行哈希处理...并在插入新的字段之前检查数据库中的字段是否有重复项。

我知道md5可以产生碰撞,也是sha-1 ......

你对我有什么建议? 我的需求是:

数据库大小:最终数据库中有10到20百万条记录

性能/速度:小散列大小,因此数据库不会对重复项进行大量负载检查(当然在该字段上会有索引)

宽容:我不在乎每10万条记录是否发生1次碰撞。 我的需求更多的是性能(小哈希)而不是0%冲突(大哈希)。

格式错误的URL可能会故意发生冲突: 非常低

如果成功攻击可能造成的最大伤害: 极低

问题:

你相信md5就足够了(建议好些什么)?

也许md5对我来说甚至有点过分 ,我可以通过使用更简单的东西认真地获得性能优势?

提前谢谢你们!

I am saving URL's in a database, and when i insert a new one, i want to check if that url exists already in the database.

A common practice (if i'm not mistaken) is to hash the urls using md5 or sha-1 etc... and checking that field in database for duplicates prior inserting a new one.

I know md5 can produce collisions, also sha-1...

What do you suggest for me? My needs are:

DB Size: Eventually 10 to 20 Millions of records on database

Performance/Speed: Small hash size so database will not have heavy load checking for duplicates (there is going to be index of course on that field)

Tolerance: I don't care if i get 1 collision on every 100,000 records. My needs are more towards performance (small hash) rather than 0% collisions (big hash).

Chance of attack by malformed URLs to produce collisions on purpose: Extremely Low

Maximum damage possible in case of such a successful attack: Extremely Low

Questions:

Do you believe md5 is enough (Something better to suggest)?

Maybe md5 is even overkill for me and i could seriously can get performance benefits by using something simpler?

Thank you in advance guys!

最满意答案

那么使用md5或一些类似的相对便宜的哈希(可能是夸克 ?),并且在极少数情况下碰撞检查匹配条目的完整URL? 这种方式大多数情况下您只需要进行廉价的哈希检查,但实际上也从未插入过重复的URL。

What about using md5 or some similar relatively inexpensive hash (maybe Quark?), and in the rare case of collision checking the full URL for the matching entries? This way the majority of the time you just have the inexpensive hash check but you also never actually insert a duplicate URL.

更多推荐