I doubt that you will get an affirmative, Bill. CRC 64 never took off. You are not talking cryptographic security else you wouldn't be using CRC 32. I guess your 'move up' may be best accommodated with MD5 at 128 bit. It has been compromised but if used as a checksum it may do if you are 'checking' a lot, of whatever, but if only occasional then I'd push the boat out with 256 or 512.
Thanks to both of you for your feedback, I'll follow up on Greg's code lead.
I am not interested in this for cryptography but for hashing purposes, to detect (near) duplicates among webpages I have downloaded for a linguistic corpus. CRC-32 has too many collisions. MD5 works fine for smaller datasets, but it has twice the storage and comparison overhead of CRC-64.
Originally posted by Bill Fletcher:
MD5 works fine for smaller datasets, but it has twice the storage and comparison overhead of CRC-64.
If MD5 is fast enough for your purpose, you could simply strip off any number of bytes you don't need of the MD5 hash value and only keep/use the number of bytes that seems enough for you to avoid many collisions.
For example, use only the first 8 bytes of the MD5 hash value ... Then you have your 64-bit 'CRC'...
Eddy www.devotechs.com -- HIME Huge Integer Math and Encryption library--
[This message has been edited by Eddy Van Esch (edited April 26, 2007).]
This is the code I use for CRC-64 type hashing. Obviously, it is
not modified for you to directly drop in and compile. But consider
it released to the Public Domain for anybody to modify and use as
they see fit.
I have no clue on how to check for the degree of "collisions" that
this code generates. Greg, Eddy, any hints?
Obviously, this code has been hacked together over a long period of
time, and I haven't cleaned it up (yet).
I intentionally left out the "crchashseed()" code. This code is used
to a minor extent in some of my security-related stuff, and I don't want to
publically expose how the seed is derived.
Hmmmm...I've been swearing for a long time that I'd go back to Win98SE
in a heartbeat if I had that option (my current box is too modern
for it to be feasible - XP is a much better choice). Now I guess
I should be glad that I can't. Those kinds of "security holes" (can't
think of a proper expression for it) would make me not want to
This is a different version of my crc-64 code that I just whipped
up. I don't know if it's really any better than my original posted
version. The difference is that it allows, and operates on, a 64-bit
seed. So, to my untrained mind, it's closer to a "true" 64-bit
CRC algorithm. Again, the code posted here is released to the Public
Domain, with the assorted notations I made in my first code posting.
David, your explanation of the statistical principles behind collisions was most useful. I was puzzled by the fact that the webpage I cite above found (if I recall correctly) 30K collisions in 18M unique documents with CRC32. I assumed it was an artifact of the CRC32 algo. Thanks, and happy (un)birthday!
Eddy, I'll try your suggestion too -- it seems to make sense. Now it all comes down to speed trials!
I'm not sure if the following has any merit or not, yet.
(1) Given two messages, whether they be strings, files or whatever, then if they are the same their respective checksums will be the same.
(2) Given two checksums which are different then their respective messages will be different else we contradict (1)
(3) Given two messages which are different then their respective checksums will be different unless they collide. We could write that as given two checksums which are the same then their respective messages will be the same unless the checksums have collided.
If (2) obtains then we have a conclusion without further ado.
With (3) we have a problem in that we don't know if a collision has occurred or not.
Suppose we introduce a second layer of checksums via a different seed.
If this second layer differed then, from (2), we have differing messages. This contradicts the first layer so the first layer must have been a collision.
From a computational view point a second layer will take twice the storage but may be the same as a 'higher' checksum, for example from CRC32 to CRC64. We must also check whether two 'lower' checksum calculations are faster than one 'higher' calculation.
However, much depends upon the environment. If we are checking files and expect few changes then the second layer will be brought into account more often than not. On the other hand, an environment where we expect many changes then the second layer will be brought into account less often.
This concept is not fool-proof as we could have a second layer collision resulting in agreement with the first layer which figured no change when, in fact, there was one.
I haven't managed to crack the maths yet: One attempt suggested that the concept pushed the CRC64 envelope and another suggested that we only halved the probability, which, if true, isn't much to write home about.
If the concept has any merit then it could be extended to MD5, SHA1 and so on. We are not talking different seeds now but different keys so we are talking HMAC then.
[This message has been edited by David Roberts (edited May 03, 2007).]
I never ported it to PB, but doing so should be easy enough
to make experimenting worth the trouble.
In the event someone wants more info on CRC, error checking,
(s)he might take a look at a copy of a document referenced in
the boost.org article which has a link that is no longer useful
to: "A Painless Guide To CRC Error Detection Algorithms",
by Ross Williams, published 19 August 1993. Here is one link
to that document that is still valid: http://www.geocities.com/SiliconVall...s/8659/crc.htm
The article was located by searching Beaucoup.COM, using "CRC".
Although the article doesn't go into CRC-64, it is well worth
spending some time developing some understanding of the principles
and following through the author's development of an example. Having
a foot in the door, so to speak, by understanding the process,
will allow one to create the ultimate CRC-xxx (xxx, is whatever