Announcement

Collapse
No announcement yet.

Odd behaviour with RANDOMIZE

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Odd behaviour with RANDOMIZE

    I am puzzled by the apparent non-randomness of numbers returned by RND following RANDOMIZE with different seed values. What I cannot seem to find in the documentation is the maximum sensible number for the seed (it just states any number in the helpfile).

    I used the following code to test the output from RND using a range of different seed values. The RND function was used to generate the largest possible range of positive integer values. The actual number generated by each line is indicated as a REM statement at the end of each line, following =>

    Code:
    RANDOMIZE 0: PRINT RND(0, 2147483647)	'=> 0
    RANDOMIZE 1: PRINT RND(0, 2147483647)	'=> 515899392
    RANDOMIZE 2: PRINT RND(0, 2147483647)	'=> 536870912
    RANDOMIZE 3: PRINT RND(0, 2147483647)	'=> 547356672
    RANDOMIZE 4: PRINT RND(0, 2147483647)	'=> 557842432
    RANDOMIZE 5: PRINT RND(0, 2147483647)	'=> 1636827136
    RANDOMIZE 6: PRINT RND(0, 2147483647)	'=> 568328192
    RANDOMIZE 7: PRINT RND(0, 2147483647)	'=> 1647312896
    RANDOMIZE 8: PRINT RND(0, 2147483647)	'=> 578813952
    RANDOMIZE 9: PRINT RND(0, 2147483647)	'=> 1118306304
    RANDOMIZE 10: PRINT RND(0, 2147483647)	'=> 1657798656
    RANDOMIZE 11: PRINT RND(0, 2147483647)	'=> 49807360
    
    RANDOMIZE 1000000: PRINT RND(0, 2147483647)	'=> 954882560
    RANDOMIZE 1000001: PRINT RND(0, 2147483647)	'=> 2033089064
    RANDOMIZE 1000002: PRINT RND(0, 2147483647)	'=> 963811920
    RANDOMIZE 1000003: PRINT RND(0, 2147483647)	'=> 2042018424
    RANDOMIZE 1000004: PRINT RND(0, 2147483647)	'=> 972741280
    RANDOMIZE 1000005: PRINT RND(0, 2147483647)	'=> 2050947784
    RANDOMIZE 1000006: PRINT RND(0, 2147483647)	'=> 981670640
    RANDOMIZE 1000007: PRINT RND(0, 2147483647)	'=> 2059877144
    RANDOMIZE 1000008: PRINT RND(0, 2147483647)	'=> 990600000
    RANDOMIZE 1000009: PRINT RND(0, 2147483647)	'=> 2068806504
    RANDOMIZE 1000010: PRINT RND(0, 2147483647)	'=> 999529360
    RANDOMIZE 1000011: PRINT RND(0, 2147483647)	'=> 2077735864
    
    RANDOMIZE 10000000: PRINT RND(0, 2147483647)	'=> 1821014080
    RANDOMIZE 10000001: PRINT RND(0, 2147483647)	'=> 1888401987
    RANDOMIZE 10000002: PRINT RND(0, 2147483647)	'=> 1955789893
    RANDOMIZE 10000003: PRINT RND(0, 2147483647)	'=> 2023177800
    RANDOMIZE 10000004: PRINT RND(0, 2147483647)	'=> 2090565706
    RANDOMIZE 10000005: PRINT RND(0, 2147483647)	'=> 10469965
    RANDOMIZE 10000006: PRINT RND(0, 2147483647)	'=> 77857871
    RANDOMIZE 10000007: PRINT RND(0, 2147483647)	'=> 145245778
    RANDOMIZE 10000008: PRINT RND(0, 2147483647)	'=> 212633684
    RANDOMIZE 10000009: PRINT RND(0, 2147483647)	'=> 280021591
    RANDOMIZE 10000010: PRINT RND(0, 2147483647)	'=> 347409497
    RANDOMIZE 10000011: PRINT RND(0, 2147483647)	'=> 414797404
    
    RANDOMIZE 30000000: PRINT RND(0, 2147483647)	'=> 1190114400
    RANDOMIZE 30000001: PRINT RND(0, 2147483647)	'=> 1190114400
    RANDOMIZE 30000002: PRINT RND(0, 2147483647)	'=> 1257502307
    RANDOMIZE 30000003: PRINT RND(0, 2147483647)	'=> 1324890213
    RANDOMIZE 30000004: PRINT RND(0, 2147483647)	'=> 1324890213
    RANDOMIZE 30000005: PRINT RND(0, 2147483647)	'=> 1324890213
    RANDOMIZE 30000006: PRINT RND(0, 2147483647)	'=> 1392278120
    RANDOMIZE 30000007: PRINT RND(0, 2147483647)	'=> 1459666026
    RANDOMIZE 30000008: PRINT RND(0, 2147483647)	'=> 1459666026
    RANDOMIZE 30000009: PRINT RND(0, 2147483647)	'=> 1459666026
    RANDOMIZE 30000010: PRINT RND(0, 2147483647)	'=> 1527053933
    RANDOMIZE 30000011: PRINT RND(0, 2147483647)	'=> 1594441839
    So a few oddities. Firstly, seed values of >30,000,000 often produce "random" numbers that are actually the same. Secondly, numbers produced using seed values between 10,000,000 and 10,000,011 show two consecutive incrementing series, while those between 1,000,000 and 1,000,011 show two interleaved incrementing series. Only the values from 1-11 seem to produce random numbers. The question is what upper limit of the seed value would return a random series of integers from RND?

    I know what I am doing looks odd, but basically, I want to convert one number (used as a seed) irreversibly into a defined "seemingly random" second number (the integer value generated by RND), and I'm not sure what a sensible upper limit for the seed value should be.

    Peter

  • #2
    Try
    RANDOMIZE
    rndvar= RND(0,maxlong)
    RANDOMIZE rndvar
    mynewrndvar=RND(0,maxlong)

    I don't quite understand what you're trying to do here, but ANY number can serve as a seed for the RANDOMIZE statement. If you use the same number you will always get the same series of 'random' numbers.
    The question is what upper limit of the seed value would return a random series of integers from RND?
    The most random-like series you will get from RANDOMIZE you will get without the number, using the TIMER as a seed.(Opinions to the contrary ignored.)

    What do you mean by irreversible? The seed? or the generated series?

    Rod
    Rod
    In some future era, dark matter and dark energy will only be found in Astronomy's Dark Ages.

    Comment


    • #3
      Originally posted by Peter Simmonds View Post

      I know what I am doing looks odd, but basically, I want to convert one number (used as a seed) irreversibly into a defined "seemingly random" second number (the integer value generated by RND), and I'm not sure what a sensible upper limit for the seed value should be.
      I assume you want this to be reproducible. i.e. using the same seed value produces the same result every time.

      It seems the series (the first 5 numbers) is identical as well and assume the entire series is identical for these randomize seed values.

      The first question is do you need the output result to be 100% unique? In other words, for each input number does it require a unique output or can there be some duplicate? It seems randomize has too much duplication, but there are algorithms that can be used for more randomness (pseudo randomness).

      One solution, I learned this from Steve Gibson (grc.com) Security now podcast, is to use the Rijndael algorithm (AES encryption algorithm). Steve just uses a counter starting at 1 to produce random numbers when needed. Rijndael produces quite random output. There is the possibility that two different inputs will product the same result, but this is very rare. Rijndael or similar algorithm may give better results than the rnd function using a known seed.

      Search the forums, the Rijndael algorithm has been posted before.

      Another solution is to use Randomize, but perform some algorithm on the number first. Something similar to the following. The following does produce negative numbers due to adding overflow of the result.

      Code:
      #COMPILE EXE
      #DIM ALL
      #REGISTER NONE
      
      FUNCTION RND2(BYVAL seed AS LONG, BYVAL n1 AS LONG, BYVAL n2 AS LONG) AS LONG
      
          LOCAL i AS LONG
          LOCAL RESULT AS LONG
      
          LOCAL seed1 AS LONG
          LOCAL seed2 AS LONG
          LOCAL result1 AS LONG
          LOCAL result2 AS LONG
          
          seed1 = HI(WORD, seed)
          seed2 = LO(WORD, seed)
      
          RANDOMIZE seed1
          result1 = RND (n1, n2)
          
          RANDOMIZE seed2
          result2 = RND(n1, n2)
      
          FUNCTION = RESULT1 + result2 ' result can overflow producing negative numbers
      
      END FUNCTION
      
      FUNCTION PBMAIN () AS LONG
          ' => correct results not shown
          RANDOMIZE 0: PRINT RND2(0, 0, 2147483647)   '=> 0
          RANDOMIZE 1: PRINT RND2(1, 0, 2147483647)   '=> 515899392
          RANDOMIZE 2: PRINT RND2(2, 0, 2147483647)   '=> 536870912
          RANDOMIZE 3: PRINT RND2(3, 0, 2147483647)   '=> 547356672
          RANDOMIZE 4: PRINT RND2(4, 0, 2147483647)   '=> 557842432
          RANDOMIZE 5: PRINT RND2(5, 0, 2147483647)   '=> 1636827136
          RANDOMIZE 6: PRINT RND2(6, 0, 2147483647)   '=> 568328192
          RANDOMIZE 7: PRINT RND2(7, 0, 2147483647)   '=> 1647312896
          RANDOMIZE 8: PRINT RND2(8, 0, 2147483647)   '=> 578813952
          RANDOMIZE 9: PRINT RND2(9, 0, 2147483647)   '=> 1118306304
          RANDOMIZE 10: PRINT RND2(10, 0, 2147483647)  '=> 1657798656
          RANDOMIZE 11: PRINT RND2(11, 0, 2147483647)  '=> 49807360
      
          PRINT RND2(1000000, 0, 2147483647)     '=> 954882560
          PRINT RND2(1000001, 0, 2147483647)     '=> 2033089064
          PRINT RND2(1000002, 0, 2147483647)     '=> 963811920
          PRINT RND2(1000003, 0, 2147483647)     '=> 2042018424
          PRINT RND2(1000004, 0, 2147483647)     '=> 972741280
          PRINT RND2(1000005, 0, 2147483647)     '=> 2050947784
          PRINT RND2(1000006, 0, 2147483647)     '=> 981670640
          PRINT RND2(1000007, 0, 2147483647)     '=> 2059877144
          PRINT RND2(1000008, 0, 2147483647)     '=> 990600000
          PRINT RND2(1000009, 0, 2147483647)     '=> 2068806504
          PRINT RND2(1000010, 0, 2147483647)     '=> 999529360
          PRINT RND2(1000011, 0, 2147483647)     '=> 2077735864
      
           PRINT RND2(10000000,0, 2147483647)    '=> 1821014080
           PRINT RND2(10000001,0, 2147483647)    '=> 1888401987
           PRINT RND2(10000002,0, 2147483647)    '=> 1955789893
           PRINT RND2(10000003,0, 2147483647)    '=> 2023177800
           PRINT RND2(10000004,0, 2147483647)    '=> 2090565706
           PRINT RND2(10000005,0, 2147483647)    '=> 10469965
           PRINT RND2(10000006,0, 2147483647)    '=> 77857871
           PRINT RND2(10000007,0, 2147483647)    '=> 145245778
           PRINT RND2(10000008,0, 2147483647)    '=> 212633684
           PRINT RND2(10000009,0, 2147483647)    '=> 280021591
           PRINT RND2(10000010,0, 2147483647)    '=> 347409497
           PRINT RND2(10000011,0, 2147483647)    '=> 414797404
      
           PRINT RND2(30000000,0, 2147483647)    '=> 1190114400
           PRINT RND2(30000001,0, 2147483647)    '=> 1190114400
           PRINT RND2(30000002,0, 2147483647)    '=> 1257502307
           PRINT RND2(30000003,0, 2147483647)    '=> 1324890213
           PRINT RND2(30000004,0, 2147483647)    '=> 1324890213
           PRINT RND2(30000005,0, 2147483647)    '=> 1324890213
           PRINT RND2(30000006,0, 2147483647)    '=> 1392278120
           PRINT RND2(30000007,0, 2147483647)    '=> 1459666026
           PRINT RND2(30000008,0, 2147483647)    '=> 1459666026
           PRINT RND2(30000009,0, 2147483647)    '=> 1459666026
           PRINT RND2(30000010,0, 2147483647)    '=> 1527053933
           PRINT RND2(30000011,0, 2147483647)    '=> 1594441839
      
      END FUNCTION

      Comment


      • #4
        Peter,
        these links may help to explain what's happening:

        User to user discussions about the PB/Win (formerly PB/DLL) product line. Discussion topics include PowerBASIC Forms, PowerGEN and PowerTree for Windows.


        Paul.

        Comment


        • #5
          Peter--

          The secret? Don't execute RANDOMIZE 2,000 times. One, and only one, RANDOMIZE followed by lots of RND() calls. It won't repeat for 2^32 references..

          Best regards,

          Bob Zale
          PowerBASIC Inc.

          Comment


          • #6
            Peter--

            If you're extraordinarily concerned about an even distribution of the RANDOMIZE seeds, start with a long integer seed (+-2^31), then scale it to a single precision float and use that value with RANDOMIZE:

            x& = Long integer seed
            FloatSeed! = CVS(MKL$(x&))
            RANDOMIZE FloatSeed!

            Best regards,

            Bob Zale
            PowerBASIC Inc.

            Comment


            • #7
              Thanks, a very active forum again!

              Just to clarify, I want to convert one number in an irreversible way. Its use is actually to convert a 10 digit patient hospital number to a coded number that cannot be reversed, and which would therefore protect confidentiality. I was really just experimenting with RANDOMIZE and RND to see whether using the patient code as a large integer seed value would generate a large integer random number that would serve as the anonymized code. I accept that the method will not produce unique numbers, but with only 10,000 patient codes and potentially over a billion output number to choose from, a duplicate code is unlikely (and not necessarily a complete disaster)

              The limitations are that the maximum integer value obtainable from RND is 2147483647, not quite the ten random digits I would need. Secondly, from the example below it seems as though seed values considerably lower than 32 bits (4,294,967,296) mentioned in the thread:



              do not generate properly random numbers, the values become grainy and unpredictable.

              In the end I split the seed number into two sets of five digits, used them as seeds and combined the lower 5 digits of the number generated by RND from each to produce the required 10 digit output. This seems to work, although there is a remarkable regular but complex (odd) relationship between seed number and RND output if plotted graphically. I would make a very nice wallpaper pattern!

              I'm still working my way through the long thread:



              Thanks again

              Peter

              PS A quick way to check for duplicate number outputs is to import the results into a spreadsheet and sort them numerically, and have a look down the list. Of course, this is limited to 2^16 numbers in Excel. Just a slightly different question, why is Excel limited to grids of 255 x 65536, which is hopelessly limiting for many purposes? Are there alternatives for Windows?

              Comment


              • #8
                Thanks, a very active forum again!

                Just to clarify, I want to convert one number in an irreversible way. Its use is actually to convert a 10 digit patient hospital number to a coded number that cannot be reversed, and which would therefore protect confidentiality. I was really just experimenting with RANDOMIZE and RND to see whether using the patient code as a large integer seed value would generate a large integer random number that would serve as the anonymized code. I accept that the method will not produce unique numbers, but with only 10,000 patient codes and potentially over a billion output number to choose from, a duplicate code is unlikely (and not necessarily a complete disaster)

                The limitations are that the maximum integer value obtainable from RND is 2147483647, not quite the ten random digits I would need. Secondly, from the example below it seems as though seed values considerably lower than 32 bits (4,294,967,296) mentioned in the thread:



                do not generate properly random numbers, the values become grainy and unpredictable.

                In the end I split the seed number into two sets of five digits, used them as seeds and combined the lower 5 digits of the number generated by RND from each to produce the required 10 digit output. This seems to work, although there is a remarkable regular but complex (odd) relationship between seed number and RND output if plotted graphically. I would make a very nice wallpaper pattern!

                I'm still working my way through the long thread:



                Thanks again

                Peter

                PS A quick way to check for duplicate number outputs is to import the results into a spreadsheet and sort them numerically, and have a look down the list. Of course, this is limited to 2^16 numbers in Excel. Just a slightly different question, why is Excel limited to grids of 255 x 65536, which is hopelessly limiting for many purposes? Are there alternatives for Windows?

                Comment


                • #9
                  "Just to clarify, I want to convert one number in an irreversible way. Its use is actually to convert a 10 digit patient hospital number to a coded number that cannot be reversed, and which would therefore protect confidentiality."

                  I think what you're looking for is a secure hash. Go to http://www.pbcrypto.com/ Archives and get SHA-1 and SHA-256. Also get FIPS180-2 from NIST which is the official description of the SHAs.

                  Cheers,
                  Dale

                  Comment


                  • #10
                    I added comments and indentation to the SHA-1 code from PBCRYPTO a few months ago. Now would be a good time to share it I guess, so posted in Source Code Forum at:

                    PowerBASIC and related source code. Please do not post questions or discussions, just source code.


                    Later,
                    Dale

                    Comment


                    • #11
                      Originally posted by Peter Simmonds View Post
                      Secondly, from the example below it seems as though seed values considerably lower than 32 bits (4,294,967,296) mentioned in the thread:
                      That's why I gave you the formula to use.

                      Comment


                      • #12
                        While the method you described should work and give you a high level of confidence that the result is unique, the purpose you mentioned puts a different solution in mind. If you merely take the patient code and add it to an indexed table, then the first patient would be 00000001, the second would be 00000002, the third 00000003, and so on. To know what patient 00000035 would be, you would have to have access to the indexed table itself to look up the patient code associated with that index value. You get a fairly good level of isolation as long as the contents of the indexed table are restricted.

                        Other techniques can be used as well. For instance, some approaches incorporate a date reference, just in case they ever have to start over. So for 03.18.2008, you might have 08031801. This would serve for up to 99 patients added on 08/03/01. Another digit, and you could have ten times as many patients admitted that day. However, this method might mean creating a second reference in the table to track the assigned number. But you would have to do that anyway, if you relied on a created random value.

                        Another approach is to use substitutions to help mask the basis for your numbering system. For instance, there are only 12 months in a year, and 26 letters. You could replace the 08031801 with 08C18001, where the C represents 03, and get an extra character for the possibility of 999 patients admitted that day. If you also wanted to represent the day value, you can combine Alphas and Numberics so that you could count 1 to 9, then A to V to represent days 1 to 31. Or you could start with A to Z, then from 1 to 5 to go the other way.

                        As long as you are expanding the table of patient codes and adding indexes, and then encoding the index in some manner that makes sense to you but is hidden to others, you can create a good system where there is no possibility of any duplicates at all. That's better than you can do with either a random or pseudo random generator. You could of course put the random numbers into a table to ensure no duplicates ever come up, but the random process does not give you any useful info about the patient, whereas your method of creating a numeric reference could impart some significant information - in this case, when the patient first became identified with your facility.

                        You might also consider if you need to have the same patient under different numbers in your table. Either you can search to see if the patient number is already entered and return that index, or you can add the patient number again and return the new index. The first method allows only one number to be associated with the patient, and also indicates the first visit ever. The second method allows each patient visit to become part of the table, since each assigned number is used to record the date of that visit. You also have the option of flagging when a patient's number is no longer active. If it is strickly a number value, you could make it negative when the patient is no longer being reported on under that code. Or you could use numbers that have a decimal component, and the decimal value indicates something special by itself.

                        The possibilities are endless, and it really raises the question of why resort to random values when real values can convey useful info to those that understand the method used in creating them?

                        Comment


                        • #13
                          Its use is actually to convert a 10 digit patient hospital number to a coded number that cannot be reversed, and which would therefore protect confidentiality.
                          Don't. do. that.

                          Unless you're a Crypto Guru or the purpose is trivial data (which patient hospital number are definately not) using a home grown algorithm is a bad idea. This is for a reason called "Security through obscurity".

                          Instead, use open, thoroughly tested standard algortihms.

                          As Dale already pointed out, you'll find a good selection of encryption/hash algorithms ported to PB over at http://www.pbcrypto.com.
                          Last edited by Knuth Konrad; 19 Mar 2008, 04:53 AM.

                          Comment


                          • #14
                            Peter, here is an implementation of the SHA1 algorithm from Dale's post above. I got duplicates only 1 in 30 or 40 runs over hundreds of runs, using 10,000 patient code datasets, and those dupes may have even been due to duplicate patient codes, not collisions of the hash values from different patient codes.
                            See code at http://www.powerbasic.com/support/pb...450#post279450 for a full description.

                            Note: added endianness fix from Dale's new source code post. 3-22-08
                            Code:
                            #COMPILE EXE
                            #DIM ALL
                            #REGISTER NONE 'because we're using our own asm here and it's possible there could be a conflict
                            
                            SUB CalcSHA1 ALIAS "CALCULATE_SECURE_HASH_ALGORITHM_1" (BYREF Str AS STRING, _
                                                                         BYREF SH_Out AS STRING * 20) EXPORT
                              #REGISTER NONE '- - - - - - - - - - - - - -
                              LOCAL lStr AS LONG '[l message in bits], here as length in bytes
                              DIM nq AS LONG '[N] number of blocks in Message (padded)
                              LOCAL n AS LONG 'Nunber of dwords in Message
                              LOCAL adrW, adrWW AS DWORD 'pointers to Message and Schedule arrays.
                              'local SH as DWrd5_Str_Union '[H0 - H4] Hash Values as dword or string
                              LOCAL H0, H1, H2, H3, H4 AS DWORD
                              LOCAL pSH_Out AS DWORD POINTER
                              DIM W(0 TO 79) AS DWORD '[W] Message Schedule {colon to TO}
                              LOCAL A, B, C, D, E AS DWORD '[working variables]
                              LOCAL TEMP AS LONG '[TEMP]
                              '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                              'Pad last block of 512 bits (is 64 bytes or 16 dwords)
                              lStr = LEN(Str)
                              nq = FIX((lStr + 8) / 64) + 1 'calc number of blocks
                              n = 16 * nq 'calc number of dwords in WW()
                              REDIM WW(0 TO n - 1) AS DWORD '[M] The Message, last block padded with "0"s.
                            
                              WW(n - 1) = lStr * 8 'Put count of last block Message bits in last dword.
                              adrW = VARPTR(W(0))
                              adrWW = VARPTR(WW(0))
                              A = STRPTR(Str)
                              'Labels CalcSHA_Lbl01 thru CalcSHA_Lbl04 used to form loops.
                              'Labels CalcSHA_Lbl05 thru CalcSHA_Lbl08 are jump targets, conditionally
                              'skipping sections of code (all jumps forward).
                              'save current EDI and ESI to stack before using those registers
                              ! PUSH EDI
                              ! PUSH ESI
                            
                              ! MOV EDI, adrWW
                              ! MOV ESI, A
                              ! MOV ECX, lStr
                              ! REP MOVSB 'repeat string operation
                              'Append a "1" bit after last Message bit in last block
                              ! MOV CL, &H80
                              ! MOV [EDI], CL
                              'Loop 1 setup
                              ! MOV EDI, adrWW
                              ! MOV ECX, 2
                              'Start Loop 1, copy Str to WW()
                              CalcSHA_Lbl01:
                                ! MOV AX, [EDI]
                                ! MOV DX, [EDI + 2]
                                ! MOV [EDI], DH
                                ! MOV [EDI + 1], DL
                                ! MOV [EDI + 2], AH
                                ! MOV [EDI + 3], AL
                                ! ADD EDI, 4
                                ! INC ECX
                                ! CMP ECX, n
                                ! JNE CalcSHA_Lbl01
                              'End Loop 1
                              'set initial hash value
                              ! MOV H0, &H67452301&
                              ! MOV H1, &HEFCDAB89&
                              ! MOV H2, &H98BADCFE&
                              ! MOV H3, &H10325476&
                              ! MOV H4, &HC3D2E1F0&
                              'Start Loop 2, move a block from Message into Message schedule
                              CalcSHA_Lbl02:
                                ! MOV EDI, adrW
                                ! MOV ESI, adrWW
                                ! MOV ECX, 64
                                ! REP MOVSB
                                ! MOV adrWW, ESI
                                'Loop 3 setup
                                ! MOV ECX, 0
                                'Start Loop 3
                                CalcSHA_Lbl03:
                                  ! MOV ESI, ECX
                                  ! ADD ESI, ESI
                                  ! ADD ESI, ESI
                                  ! ADD ESI, adrW
                            
                                  ! MOV EAX, [ESI + 52]
                                  ! XOR EAX, [ESI + 32]
                                  ! XOR EAX, [ESI + 8]
                                  ! XOR EAX, [ESI]
                            
                                  ! MOV EDX, EAX
                                  ! SHL EAX, 1
                                  ! SHR EDX, 31
                                  ! OR  EAX, EDX
                                  ! MOV [ESI + 64], EAX
                            
                                  ! INC ECX
                                  ! CMP ECX, 64
                                  ! JNE CalcSHA_Lbl03
                                'End Loop 3
                                ! MOV EAX, H0
                                ! MOV A, EAX
                                ! MOV EAX, H1
                                ! MOV B, EAX
                                ! MOV EAX, H2
                                ! MOV C, EAX
                                ! MOV EAX, H3
                                ! MOV D, EAX
                                ! MOV EAX, H4
                                ! MOV E, EAX
                                'Loop 4 setup
                                ! MOV EDI, 0
                                'Start Loop 4
                                CalcSHA_Lbl04:
                                  ! CMP EDI, 19
                                  ! JA CalcSHA_Lbl05
                            
                                  ! MOV ECX, B
                                  ! AND ECX, C
                                  ! MOV EAX, B
                                  ! NOT EAX
                                  ! AND EAX, D
                                  ! OR  ECX, EAX
                                  ! ADD ECX, &H5A827999&
                                  ! JMP CalcSHA_Lbl08
                            
                                  CalcSHA_Lbl05:
                                  ! CMP EDI, 39
                                  ! JA CalcSHA_Lbl06
                            
                                  ! MOV ECX, B
                                  ! XOR ECX, C
                                  ! XOR ECX, D
                                  ! ADD ECX, &H6ED9EBA1&
                                  ! JMP CalcSHA_Lbl08
                            
                                  CalcSHA_Lbl06:
                                  ! CMP EDI, 59
                                  ! JA CalcSHA_Lbl07
                            
                                  ! MOV EAX, B
                                  ! AND EAX, C
                                  ! MOV ECX, B
                                  ! AND ECX, D
                                  ! MOV EDX, C
                                  ! AND EDX, D
                                  ! OR  ECX, EAX
                                  ! OR  ECX, EDX
                                  ! ADD ECX, &H8F1BBCDC&
                                  ! JMP CalcSHA_Lbl08
                            
                                  CalcSHA_Lbl07:
                                  ! MOV ECX, B
                                  ! XOR ECX, C
                                  ! XOR ECX, D
                                  ! ADD ECX, &HCA62C1D6&
                            
                                  CalcSHA_Lbl08:
                                  ! MOV EAX, A
                                  ! MOV EDX, EAX
                                  ! SHL EAX, 5
                                  ! SHR EDX, 27
                                  ! OR  EAX, EDX
                                  ! ADD EAX, E
                                  ! ADD ECX, EAX
                            
                                  ! MOV ESI, EDI
                                  ! ADD ESI, ESI
                                  ! ADD ESI, ESI
                                  ! ADD ESI, adrW
                                  ! MOV ESI, [ESI]
                                  ! MOV TEMP, ESI
                            
                                  ! ADD Temp, ECX
                                  ! MOV EAX, D
                                  ! MOV E, EAX
                                  ! MOV EAX, C
                                  ! MOV D, EAX
                                  ! MOV EAX, B
                                  ! MOV EDX, EAX
                                  ! SHL EAX, 30
                                  ! SHR EDX, 2
                                  ! OR  EAX, EDX
                                  ! MOV C, EAX
                                  ! MOV EAX, A
                                  ! MOV B, EAX
                                  ! MOV EAX, TEMP
                                  ! MOV A, EAX
                            
                                  ! INC EDI
                                  ! CMP EDI, 80
                                  ! JNE CalcSHA_Lbl04
                                'End Loop 4
                                ! MOV EAX, A
                                ! ADD H0, EAX
                                ! MOV EAX, B
                                ! ADD H1, EAX
                                ! MOV EAX, C
                                ! ADD H2, EAX
                                ! MOV EAX, D
                                ! ADD H3, EAX
                                ! MOV EAX, E
                                ! ADD H4, EAX
                            
                                ! SUB nq, 1
                                ! JNE CalcSHA_Lbl02 'done if nq = 0
                              'End Loop 2
                              'Byte swap for correct endian sequence <--- {Added}
                              ! MOV EAX, H0
                              ! BSWAP EAX
                              ! MOV H0, EAX
                              ! MOV EAX, H1
                              ! BSWAP EAX
                              ! MOV H1, EAX
                              ! MOV EAX, H2
                              ! BSWAP EAX
                              ! MOV H2, EAX
                              ! MOV EAX, H3
                              ! BSWAP EAX
                              ! MOV H3, EAX
                              ! MOV EAX, H4
                              ! BSWAP EAX
                              ! MOV H4, EAX ; <--- {end of added code}
                              'restore previous ESI and EDI from stack
                              ! POP ESI
                              ! POP EDI
                              'set return hash
                              pSH_Out = VARPTR(SH_Out)
                              @pSH_Out = H0
                              INCR pSH_Out
                              @pSH_Out = H1
                              INCR pSH_Out
                              @pSH_Out = H2
                              INCR pSH_Out
                              @pSH_Out = H3
                              INCR pSH_Out
                              @pSH_Out = H4
                              INCR pSH_Out
                            END SUB
                            '-------------------------------------------------------
                                                                                 'demos secure conversion of 10 digits to 10 secure digits
                                                                                 'note: There is a chance of duplication, ie. two different numbers
                                                                                 'resulting in the same 10-secure-digits result, because we're only
                                                                                 'using a tiny fraction of the 20 bytes of secure hash data.
                            FUNCTION PBMAIN () AS LONG
                               LOCAL patientCode AS STRING
                               LOCAL securePatientCode AS STRING * 20
                               LOCAL ii AS LONG
                               RANDOMIZE
                            
                               FOR ii = 1 TO 10
                                   patientCode = FORMAT$(RND * 100000, "00000") & FORMAT$(RND * 100000, "00000") 'random patient# to test
                                   CalcSHA1(patientCode, securePatientCode)                                      'hash algorithm makes secure conversion.
                            
                                                                                 'Now convert the binary data to 10 numbers. Here I used eight bytes (QUAD)
                                                                                 'from the middle of the securePatientCode to get 10 decimal digits.
                                                                                 'You may easily choose any other of the bytes you wish, or do the
                                                                                 'conversion of 20-bytes-to-ten-digits some other way as you see fit.
                                   securePatientCode = MID$(FORMAT$(CVQ(MID$(securePatientCode, 5, 8)), "000000000000000000"), 3, 10)
                                   ? patientCode & " patient code" & $CRLF & RTRIM$(securePatientCode) & " secure code",,"Ten Examples..."
                               NEXT
                            
                            END FUNCTION
                            Last edited by John Gleason; 22 Mar 2008, 07:30 AM. Reason: added endianness fix from Dale's new source code post

                            Comment


                            • #15
                              Thanks

                              Donald, the number is not designed to record anything about the individual involved, it's generated simply to identify different samples from the same individual (there's a whole lot of other anonymous data recorded along with the transformed hospital number that deals with some of the data you describe).

                              I was unwilling to use any kind of key encryption, because the key would have to be kept, and in principal somebody in the future might decide to use it to recover the original numbers. I have to demonstrate that the link between the hospital code and the new patient identifier is not recoverable. For a similar reason I was unwilling to use look-up tables, as they would similarly have to be kept for ongoing anonymisation and could in theory be used in reverse.

                              At least the method I did use requires no key or tables, and because the data is partially corrupted during conversion the process cannot be reversed even for somebody with a copy of the program and PBCC4.0.

                              However, I do agree entirely with the opinion that it is a much better idea to use a proper program!

                              Peter

                              Comment


                              • #16
                                John

                                Thanks very much for the code conversion, looks like its just the job, and I'll see whether I can get it working.

                                Thanks all for help on this forum.

                                Best wishes

                                Peter

                                Comment


                                • #17
                                  John

                                  'Now convert the binary data to 10 numbers. Here I used eight bytes (QUAD)
                                  'from the middle of the securePatientCode to get 10 decimal digits.

                                  Eight bytes is 64 bits of information. There is, roughly, a 50% chance of collisions beyond 32 bits.

                                  2^32 = 10^9.63 ie 9.63 digits. We are are using 10 so we are pushing our luck.

                                  If we use half of the SHA1, ie 10 bytes, that 9.63 becomes 12.04 digits.

                                  Comment


                                  • #18
                                    Dave, (and Peter), here is the code I used to test for dupe frequency. Technically, I'm only using five bytes in that quad to get my 10-digit secure code. Since it's a 10>>10 direct mapping of digits from plaintext to encoded text, I wonder: does it matter which five of the encoded 20 bytes are used to generate those 10 digits? Or does it matter whether, say, 10 encoded bytes are used, each forming its own encoded digit?

                                    The algorithm below uses the SUB from the previous post. I have to run it several times (and each run is 20 sets of 10,000 codes) just to get a single dupe, and even that dupe may be due to a dupe patientCode rather than a collision. Of course, any suggestions / improvements / corrections are always welcome.
                                    Code:
                                    %NUMBERofPATIENTS = 10000
                                    FUNCTION PBMAIN () AS LONG
                                       LOCAL patientCode AS STRING
                                       LOCAL securePatientCode AS STRING * 20
                                       LOCAL ii, ii2, dupe AS LONG
                                       DIM dupeArr(1 TO %NUMBERofPATIENTS) AS STRING
                                       RANDOMIZE
                                       FOR ii2 = 1 TO 20
                                          FOR ii = 1 TO %NUMBERofPATIENTS
                                              patientCode = FORMAT$(RND * 100000, "00000") & FORMAT$(RND * 100000, "00000") 'random patient# to test
                                              CalcSHA1(patientCode, securePatientCode)                                      'hash algorithm makes secure conversion.
                                    
                                                                                            'Now convert the binary data to 10 numbers. Here I used eight bytes (QUAD)
                                                                                            'from the middle of the securePatientCode to get 10 decimal digits.
                                                                                            'You may easily choose any other of the bytes you wish, or do the
                                                                                            'conversion of 20-bytes-to-ten-digits some other way as you see fit.
                                              securePatientCode = MID$(FORMAT$(CVQ(MID$(securePatientCode, 5, 8)), "000000000000000000"), 3, 10)
                                              dupeArr(ii) = RTRIM$(securePatientCode)
                                       '       ? patientCode & " patient code" & $crlf & rtrim$(securePatientCode) & " secure code",,"Ten Examples..."
                                          NEXT
                                          ARRAY SORT dupeArr()
                                          FOR ii = 1 TO %NUMBERofPATIENTS - 1
                                             IF dupeArr(ii + 1) = dupeArr(ii) THEN INCR dupe
                                          NEXT
                                       NEXT
                                           ? STR$(dupe)
                                    END FUNCTION

                                    Comment


                                    • #19
                                      When we talk about collisions we must also talk about the consequences of collisions. It is a bit like quality control. It may be cheaper in the long run to let a certain percentage of televisions leave the factory not up to spec and replace them in the event of an early failure rather than have an expensive quality control to ensure that a very much smaller number get through.

                                      However, we should not be using that mind set when we talk about shuttle missions or, to the point, patient records. With such subjects I wouldn't want to hear about how many minutes, or whatever, it took to get a duplication. I want to hear that your motherboard fried first before it found one.

                                      In this context I'd use MD5 and I wouldn't throw any information away.

                                      > does it matter which five of the encoded 20 bytes are used to generate those 10 digits?

                                      If it did matter then the underlying algorithm would be in dead trouble.

                                      Comment


                                      • #20
                                        In this context I'd use MD5 and I wouldn't throw any information away.
                                        So, the best thing to do is to keep all those 20 bytes, then we're assured of no collisions. I used SHA-1 above rather than MD5, but that's certainly of less importance than the number of encoded bytes used from the chosen algorithm. The question for Peter therefore becomes can you either 1) incorporate those 20 bytes and insure no collisions, or 2) check the encoded 10-digit data for dupes like I did in the ARRAY SORT above, and fix potential duplicates that may arise?

                                        Comment

                                        Working...
                                        X