Announcement

Collapse
No announcement yet.

Absolute assurance of file uniqueness.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Absolute assurance of file uniqueness.

    I was reading that it is possible for 2 different files to produce the same MD5 hash.

    As part of my exiftool musings I am making a thing that looks across all the drives on a computer and catalogs the images. Part of that process is dealing with the duplicates. I was thinking MD5 hash and number of file bytes should be a good indicator of uniqueness - also there may be exiftool data that assures uniqueness.

    Any thoughts on MD5 and file uniqueness? I am trying to not trust stuff like file data and time - unless it comes from exiftool data not file system data.


  • #2
    Absolute assurance . . .
    No such thing . . . unless you compare whole files.

    "Good enough." depends on purpose. Documents with MD5 collisions have been created, but do they make sense?

    Trust MD5 to verify transfer of $100 million, no.

    A different number of bytes will cause a radically different hash.
    Dale

    Comment


    • #3
      If speed is not critical , you could try hashing each file twice with different algorithms. The odds on getting non-genuine collisions with two different hashes on the same file would be pretty remote.
      Last edited by Stuart McLachlan; 12 Jan 2021, 03:00 AM.

      Comment


      • #4
        MD5 has a security level 0f 64-bit which is woefully weak today. SHA256 has a security level of 128-bit and should hold us in good stead for a few years yet.

        2^128 x 2^128 = 2^256.

        Use SHA256.

        If you must use 128-bit, then use MD5 HMAC.


        Comment


        • #5
          Originally posted by David Roberts View Post
          MD5 has a security level 0f 64-bit which is woefully weak today. SHA256 has a security level of 128-bit and should hold us in good stead for a few years yet.
          Security level is completely irrelevant when all you are doing is comparing hashes of multiple image files to identify duplicates.
          It doesn't matter if your hash function has zero security. All you need is a fast non-crytographic hash , something like SeaHash or xxHash.for example. (Actually since they are about 50 times as fast as MH5/SHA1 et al, you could use both a lot faster than a single cryptographic hash - use one on the first pass and then the other one on any collisions)

          Comment


          • #6
            Originally posted by Stuart
            It doesn't matter if your hash function has zero security.
            Agreed.

            From the SeaHash website: "It aims to have high quality pseudorandom output and few collisions, as well as being fast."

            With SHA256 the likelihood of a collision is very nearly zero. Nobody has managed to 'manufacture' a collision with MD5 HMAC yet.

            A few years ago, quite a few, a guy on this forum had an application which had worked for years without issue and started to have issues. His latest system had many more files than his old system had when he wrote the application. He was using CRC32. With a stronger hash the problem ceased.

            Comment


            • #7
              I'm no expert, but don't I remember reading that the chances of an non-intentional MD5 collision is 1 quadrillion squared, or something absurd like that?

              I also remember a long-ago forum thread about CRC32. Ah, the good old days.
              "Not my circus, not my monkeys."

              Comment


              • #8
                As far as I know no one has experienced a MD5 collision in the 'wild' yet.

                We have a similar argument with random number generators. A graphics programmer may want blinding fast numbers but is not concerned with top drawer randomness. In that case, a LCG will do. On the other hand, someone may want top drawer randomness in which case a LCG will not do.

                Someone may want a blindingly fast hash function but is not concerned with the odd collision. On the other hand, someone may not want collisions at all.

                If collisions are out of the question then use SHA256. It is as simple as that. Well, not quite - Blake2b is better and Blake3 is better still. Good luck on implementing them.

                Comment


                • #9
                  I'd certainly understand if SHA256 was used for the software that runs nuclear power stations, but when looking for duplicate image files your first sentence says it all for me. I'd want my PowerBASIC program to run faster.
                  "Not my circus, not my monkeys."

                  Comment


                  • #10
                    You cannot allow perfection to be the enemy of the perfectly acceptable.
                    Michael Mattias
                    Tal Systems (retired)
                    Port Washington WI USA
                    [email protected]
                    http://www.talsystems.com

                    Comment


                    • #11
                      On paper MD5 is faster than SHA256 but in practice an application which is reading many files from a drive will not see the 'paper' difference. AES128 is seven times faster than AES256 on paper. A few years ago I was using AES128 in an application and was streaming large files. Out of curiosity I switched to AES256 and found that the 'edge' of AES128 had been greatly diminished. I stayed with AES256.

                      If David uses the Microsoft APIs he can easily switch from MD5 to SHA256 and may find that the performance hit is nothing like what it says on 'paper'. Bear in mind the title of this thread: "Absolute assurance of file uniqueness."

                      Comment


                      • #12
                        Thanks All! Interesting thoughts. Much to ponder.

                        And yes, I could do a byte by byte comparison in a pinch.

                        Comment


                        • #13
                          You're doing a quick test for equal file length before doing any hash calculations, right?

                          BTW just for fun... I have a number of 19702x10462 TIFF files on my local drive, up to 600 MB each. Photoshop doesn't like them very much.
                          "Not my circus, not my monkeys."

                          Comment


                          • #14
                            "You're doing a quick test for equal file length before doing any hash calculations, right?"

                            Yes!

                            Comment


                            • #15
                              Originally posted by Eric Pearson View Post
                              You're doing a quick test for equal file length before doing any hash calculations, right?

                              BTW just for fun... I have a number of 19702x10462 TIFF files on my local drive, up to 600 MB each. Photoshop doesn't like them very much.
                              Ouch,

                              I thought my 20013 x 10427 pixel "Admiralty_Chart_No_5308_The_World_Sailing_Ship_Routes,_Published_1946.jpg" was big at 36 MB
                              (but it too is 600MB in memory when loaded in Irfanview - guess if I saved it as Tiff it wo uld be the same )


                              Comment


                              • #16
                                In this case you don't need cryptographic strength, you could even use FNV-1a which is fast and produces a small output numbers (standard implementations even as small as a dword). There are small chances of duplicates but if you consider the file size as part of your compare it starts to reduce potential collisions.
                                <b>George W. Bleck</b>
                                <img src='http://www.blecktech.com/myemail.gif'>

                                Comment


                                • #17
                                  Hi George,
                                  What is this FNV-1a ? any code to show how it works?

                                  Comment


                                  • #18
                                    Originally posted by Tim Lakinir View Post
                                    Hi George,
                                    What is this FNV-1a ? any code to show how it works?
                                    Try https://tinyurl.com/y4dqv989

                                    Comment


                                    • #19
                                      I have these functions from a project and I am sure I got them from the forum so credit is due to someone else...

                                      Code:
                                      FUNCTION FNV32( BYVAL dwOffset AS DWORD, BYVAL dwLen AS DWORD, BYVAL offset_basis AS DWORD ) AS DWORD
                                      #REGISTER NONE
                                        ! mov esi, dwOffset ;esi = ptr to buffer
                                        ! mov ecx, dwLen ;ecx = length of buffer (counter)
                                        ! mov eax, offset_basis ;set to 0 for FNV-0, or 2166136261 for FNV-1
                                        ! mov edi, &h01000193 ;FNV_32_PRIME = 16777619
                                        ! xor ebx, ebx ;ebx = 0
                                      nextbyte:
                                        ! mul edi ;eax = eax * FNV_32_PRIME
                                        ! mov bl, [esi] ;bl = byte from esi
                                        ! xor eax, ebx ;al = al xor bl
                                        ! inc esi ;esi = esi + 1 (buffer pos)
                                        ! dec ecx ;ecx = ecx - 1 (counter)
                                        ! jnz nextbyte ;if ecx is 0, jmp to NextByte
                                        ! mov FUNCTION, eax ;else, function = eax
                                      END FUNCTION
                                      
                                      
                                      
                                      '----------------------------------------------------------------------------(')
                                      
                                      
                                      
                                      FUNCTION String2FNV32( BYVAL strText AS STRING ) AS DWORD
                                        FUNCTION = FNV32( BYVAL STRPTR( strText ), LEN( strText ), 2166136261 )
                                      END FUNCTION
                                      <b>George W. Bleck</b>
                                      <img src='http://www.blecktech.com/myemail.gif'>

                                      Comment


                                      • #20
                                        2002 post by Wayne Diamond
                                        2014 post by Wayne DIamond

                                        Comment

                                        Working...
                                        X