重复URL检查哪种哈希算法?(Which hash algorithm for duplicate URLs checking?)



我知道md5可以产生碰撞,也是sha-1 ......

你对我有什么建议? 我的需求是:



宽容:我不在乎每10万条记录是否发生1次碰撞。 我的需求更多的是性能(小哈希)而不是0%冲突(大哈希)。

格式错误的URL可能会故意发生冲突: 非常低

如果成功攻击可能造成的最大伤害: 极低



也许md5对我来说甚至有点过分 ,我可以通过使用更简单的东西认真地获得性能优势?


I am saving URL's in a database, and when i insert a new one, i want to check if that url exists already in the database.

A common practice (if i'm not mistaken) is to hash the urls using md5 or sha-1 etc... and checking that field in database for duplicates prior inserting a new one.

I know md5 can produce collisions, also sha-1...

What do you suggest for me? My needs are:

DB Size: Eventually 10 to 20 Millions of records on database

Performance/Speed: Small hash size so database will not have heavy load checking for duplicates (there is going to be index of course on that field)

Tolerance: I don't care if i get 1 collision on every 100,000 records. My needs are more towards performance (small hash) rather than 0% collisions (big hash).

Chance of attack by malformed URLs to produce collisions on purpose: Extremely Low

Maximum damage possible in case of such a successful attack: Extremely Low


Do you believe md5 is enough (Something better to suggest)?

Maybe md5 is even overkill for me and i could seriously can get performance benefits by using something simpler?

Thank you in advance guys!


那么使用md5或一些类似的相对便宜的哈希(可能是夸克 ?),并且在极少数情况下碰撞检查匹配条目的完整URL? 这种方式大多数情况下您只需要进行廉价的哈希检查,但实际上也从未插入过重复的URL。

What about using md5 or some similar relatively inexpensive hash (maybe Quark?), and in the rare case of collision checking the full URL for the matching entries? This way the majority of the time you just have the inexpensive hash check but you also never actually insert a duplicate URL.
