Announcement

Collapse
No announcement yet.

Constants for Random Numbers Generator

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Constants for Random Numbers Generator

    Hi all,

    The RND function incorporated in PB (as in most other languages) is based on the linear congruence algorithm and natively produces a cyclic sequence of 2^31 Long Integers (then reduced to floats in the interval 0-1).
    The formula for successive random numbers is (similar to) the following:
    X(n+1) = k*X(n) MOD m
    where a "good choice" of the constants k and m is crucial for getting a decent "pseudorandomness".
    In the "32 bits" (LONG integer) based generator, m is usually (2^31-1).

    Now I would like to build my own generator producing a much longer cyclic sequence, based on QUAD integers.
    Che choice for m would then be (2^63-1), but I need a "good" and tested value for k.

    Does anybody have the right suggestion ?

    P.S. Knuth gives the value for m = (2^64-1), but I need m = (2^63-1) so that I can use the signed QUAD datatype.

    P.P.S.
    Sorry, I realize that for using standard QUAD arithmetics m can never be (2^63-1). It must be such that m*k < (2^63-1).
    Last edited by Aldo Vitagliano; 7 Apr 2008, 06:54 AM.
    Aldo Vitagliano
    alvitagl at unina it

  • #2
    Here are the Knuth constants for a generator that fits into a quad. It has a period of ~ 2^63, however each random number is only the low 31 bits. Were you thinking something like this?

    Code:
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
    'Code corrected and reposted below (#5).
    END FUNCTION
    Last edited by John Gleason; 7 Apr 2008, 12:52 PM.

    Comment


    • #3
      Have you seen this post?

      Comment


      • #4
        Thank you for your replies.
        John's suggestion looks simple and would in principle be enough for me.
        Two questions:
        1) I suppose that at the beginning it can be safely fed with any couple of LONG seeds. Is this true ?
        2) Do you have references about it ?
        Aldo Vitagliano
        alvitagl at unina it

        Comment


        • #5
          Aldo, I corrected the above code in this post: some of the variables should be QUADS. Also, in order to account for that 32nd bit that is always 0 in the algorithm, below is code (not particularly optimized btw) which handles it.

          >>1) I suppose that at the beginning it can be safely fed with any couple of LONG seeds. Is this true ?
          Correct, doing that gives us potentially approx. 2^64 starting points in the sequence, but I believe the period is only 2^62 or 2^63.
          >>2) Do you have references about it ?
          I have a PDF of Knuth's paper describing the algorithm. I'll try to find it asap.

          Code:
          #COMPILE EXE
          #DIM ALL
          
          FUNCTION PBMAIN () AS LONG
               LOCAL x1, kRnd, kRnd2, x0 AS QUAD
               LOCAL ii, kFin AS LONG
               x0 = 1995844334   'you supply these two long seeds
               x1 = 1588392233   '              "
               
               FOR ii = 1 TO 10
                    kRnd = (271828183 * x1 - 314159269 * x0) MOD &h7fffffff
                    x0 = x1
                    x1 = kRnd
                    kRnd2 = (271828183 * x1 - 314159269 * x0) MOD &h7fffffff
                    x0 = x1
                    x1 = kRnd2
                    SHIFT LEFT kRnd2, 1           'move a random bit into the 32nd bit spot
                    kRnd2 = kRnd2 AND &h80000000
                    kFin = kRnd + kRnd2
                    ? STR$(kFin)
               NEXT
               ? "k"
          END FUNCTION

          Comment


          • #6
            I'm afraid I haven't found my PDF with the info on the Knuth algo, but the below link describes it briefly. Look for gsl_rng_knuthran2 on the page:

            http://www.gnu.org/software/gsl/manu...enerators.html

            Comment


            • #7
              Thanks,
              but ... according to the link you provided, shouldn't there be a (+) sign rather than (-) here ?

              271828183 * x1 - 314159269 * x0
              Aldo Vitagliano
              alvitagl at unina it

              Comment


              • #8
                >>shouldn't there be a (+) sign rather than (-) here ?

                Hmmm... indeed it does look that way. I wonder if I got my formula from an older version (than 2002) of if it is just an error on my part, or if it matters, ie. it works either way. Well, I've extensively tested the "-" algo and it passed all tests for randomness (like "Diehard" and NIST tests), but it is a poser!
                Well, until you or I can find out, here is another simpler and faster one, and it's the one I use 99% of the time:

                Code:
                #COMPILE EXE
                #DIM ALL
                
                FUNCTION PBMAIN () AS LONG    'George Marsaglia random # gen. Period ~ 2^63
                     LOCAL mRnd AS QUAD
                     LOCAL x0, x1, x2 AS DWORD
                     LOCAL ii, x1Long AS LONG
                     x0 = 1995844334   'you supply this seed: 0 < seed < 2087494968
                     x1 = 3588392233   '              "     : 0 < seed < 4294967296
                     x2 = &h7C6CA538   'prime of special kind (Sophie Germain). You can choose any of millions where both x2 * 2^32 -1, and x2 * 2^31 -1 are prime
                
                     FOR ii = 1 TO 10
                          mRnd = x2 * x1 + x0
                          x0 = HI(DWORD, mRnd)
                          x1 = LO(DWORD, mRnd)     '<< this is your random DWORD.
                          x1Long = x1              'You can also convert the DWORD to a LONG if needed.
                          ? STR$(x1) & $CRLF & STR$(x1Long)
                     NEXT
                     ? "k"
                END FUNCTION
                Last edited by John Gleason; 7 Apr 2008, 02:37 PM. Reason: removed save file code

                Comment


                • #9
                  and then of course you can always change generators as you wish

                  Comment


                  • #10
                    I found the Knuth equation reference from The Laws of Cryptography with Java Code by Neal R. Wagner, p.102, and it shows a subtraction, not addition.

                    It's almost a certainty that both + and - are correct, because given that the randoms are full range LONGS, they are split 50-50 positive-negative (well, maybe 1 extra negative in there), and it doesn't matter if you add a positive or subtract a negative. This postulate is borne out in the randomness tests, as both Knuth versions pass all tests I have, and that's a lot of tests.

                    There is still the issue tho that you need to generate two Knuth randoms to get a full 32-bit random. If you need all 32 bits, this cuts its net speed more than half. The Marsaglia randoms are 32-bit in one pass, have millions more 2^63 sequences, and also pass all the randomness tests.

                    Comment


                    • #11
                      Thank you John. I have adopted the (+) version, because I only want positive numbers, and then I also adopt a further "shuffling" method as described in Numerical Recipes, section 7.1.
                      You said you have run a lot of tests for randomness. Do you have "ready to use" routines that you will be willing to share ?
                      Aldo Vitagliano
                      alvitagl at unina it

                      Comment


                      • #12
                        >>Do you have "ready to use" routines that you will be willing to share ?

                        Sure Aldo, the three I mainly use are Diehard (new version), RaBiGeTe, and a big Chi Square algo I wrote based on the Ent test.

                        here's the link to download RaBiGeTe:
                        http://www.webalice.it/cristiano.pi/rabigete/

                        I can only find links to the old version of Diehard on the net right now, eg.
                        http://www.stat.fsu.edu/pub/diehardd/
                        so if you'd like the new version, I can email it to you.

                        Send me a private message with your email if you want me to send you the new Diehard and/or the big Chi^2 program.

                        Added: I found a link to the new version of Diehard:
                        http://i.cs.hku.hk/~diehard/index.html
                        Last edited by John Gleason; 9 Apr 2008, 06:48 AM. Reason: added link

                        Comment


                        • #13
                          Thanks again.
                          I downloaded the Diehard2 package at the link you provided. Then I generated two huge binary files using the gsl_rng_knuthran2 generator (+ version), with further shuflling of the output, and I submitted the files to the full set of tests.
                          From the little I understood of the meaning of the "p-values" everything should be OK, but I would like to get a slightly better understanding. Is there a practical way of converting the summary of 269 p-values given at the end of the report to an information such as: "From A to E, your sequence of 10^11 bits is A-class random ?"
                          Aldo Vitagliano
                          alvitagl at unina it

                          Comment


                          • #14
                            Originally posted by Aldo Vitagliano View Post
                            Is there a practical way of converting the summary of 269 p-values given at the end of the report to an information such as: "From A to E, your sequence of 10^11 bits is A-class random ?"
                            If your summary p-values pass their KStest, you can be quite assured any subset of your data will pass, but you can't determine that specifically from the summary p-values. You would need to extract the characters A to E from your 10^11 test file bits to a new file in their original order, then carefully up-convert that file to binary (base 256--all characters) so you can rerun Diehard on that specific subset. However, correctly up-converting base5 data to base256 data will likely be tricky.

                            One other point: Diehard only tests the first ~275MB of your files, so you can split your 10^11 bits into many files of say 290MB to be safe, then test some of those to be sure you have a good algo. If you test, say 10 files, your 10 Overall p-value after applying KStest on 269 p-values totals should be generally evenly distributed from 0 to 1.
                            Last edited by John Gleason; 10 Apr 2008, 08:34 AM. Reason: I noticed file size is in bits, not bytes

                            Comment


                            • #15
                              Ooops ... I counted definitely too many bits: it was a sequence of 272 Mbytes, so slightly more than 2e9 bits.

                              So, If I undertand correctly, the ultimate final test is based on the uniform distribution of the p-values of the KS test made on the set of 269 p-values obtained from the individual tests. So the Diehard test should be repeated many times on different sequences of data produced by the same generator and the final p-values collected and examined for uniform distribution.
                              Is this correct ?
                              Aldo Vitagliano
                              alvitagl at unina it

                              Comment


                              • #16
                                >>the final p-values collected and examined for uniform distribution. Is this correct ?

                                Yes, that would be among the most stringent proofs for quality using the Diehard tests.

                                A note about the A-E distribution you mentioned: using your 272MB file for example, 5 of every 256 bytes will be "A" thru "E" (CHR$(&h41) to CHR$(&h45)) or about 2% or 5.3MB of your random bytes. But you can get a much higher % return rate of those 5 characters by, for example, using the two nybbles of each byte to represent A thru E if the chosen (nybble AND 7) has a value of eg. 1 thru 5.

                                Here is an illustration showing this to get many times the output (86% of bytes succeed in producing at least one, 39% produce two A-E results):
                                Code:
                                #COMPILE EXE
                                #DIM ALL
                                
                                FUNCTION PBMAIN () AS LONG
                                    LOCAL ii, x, a2e1, a2e2 AS LONG, rndbytes AS STRING
                                
                                    FOR ii = 21 TO 40
                                       rndBytes = "gv´œÍHµ„¾à‰vá°FLþä#n®~ë[PÃyV‘%¥º9‘÷á|vf:þUóÁav¬à[email protected]üIR”Wô°„Edé‡"
                                       x = ASC(rndBytes, ii)     'next byte from rand string
                                       a2e1 = x AND &b000000111  'low nybble AND 7
                                       a2e2 = x AND &b001110000  'hi  nybble AND 7
                                       SHIFT RIGHT a2e2, 4       'move to low nybble (0-7)
                                
                                       SELECT CASE a2e1
                                          CASE 1: ? "low A"
                                          CASE 2: ? "low B"
                                          CASE 3: ? "low C"
                                          CASE 4: ? "low D"
                                          CASE 5: ? "low E"
                                          CASE ELSE: ? "No low match this byte"
                                       END SELECT
                                
                                       SELECT CASE a2e2
                                          CASE 1: ? "hi A"
                                          CASE 2: ? "hi B"
                                          CASE 3: ? "hi C"
                                          CASE 4: ? "hi D"
                                          CASE 5: ? "hi E"
                                          CASE ELSE: ? "No hi match this byte"
                                       END SELECT
                                    NEXT
                                END FUNCTION

                                Comment


                                • #17
                                  Originally posted by John Gleason View Post
                                  >>the final p-values collected and examined for uniform distribution. Is this correct ?
                                  Yes, that would be among the most stringent proofs for quality using the Diehard tests.
                                  OK. Fine.
                                  A note about the A-E distribution you mentioned: using your 272MB file for example, 5 of every 256 bytes will be "A" thru "E" (CHR$(&h41) to CHR$(&h45)) or about 2% or 5.3MB of your random bytes...
                                  AhAh... Sorry, I didn't mean generation of random characters A ... E
                                  I had in mind a response assigning a mark, or grade (as after an exam) to the "quality of randomness" of the sequence: A = very good ... E= very bad
                                  Regards
                                  Aldo
                                  Aldo Vitagliano
                                  alvitagl at unina it

                                  Comment


                                  • #18
                                    >>or grade (as after an exam) to the "quality of randomness"

                                    I see the idea now. Well, you have some code there as a starting point if you ever do need to limit the output characters.

                                    As far as the grade goes, the final KStest is mostly pass-fail. It is very unforgiving and usually gives an E (.000000 or 1.000000) if there's even a tiny flaw in the algo. An overall result of eg. .001045 is highly suspect (a D- maybe), but will occur by chance ~ 1 in 1000 times (.998955 being its opposite). But eg. .990842 or .010322 occur with seeming regularity and can easily be part of an A algo. It's not a "failure at the 99% confidence level" as you might hear in statistics reporting. In fact, a good algo must produce those values along with all the others in the linear distribution. Sort and graph your 269 summary p-values and they should form a straight line 0-100. So should eg. 100 overall p-values from 100 separate 269 p-val summaries. It's like a recursion, no matter how deep you go, the p-values always should be uniformly distributed.

                                    If however, you run the test 3 times in a row and get for the overall: .010332; .009233; .012322... D- or E. It has a virtually guaranteed flaw. Tho the algo must be extremely random to achieve even those D-/E values, check that code carefully because some glitch will likely turn up.
                                    Last edited by John Gleason; 10 Apr 2008, 02:31 PM.

                                    Comment


                                    • #19
                                      I'd like to chip in here.

                                      All the tests above are based upon our assuming that the data we are considering is random data.

                                      It is the job of the tests to disprove our assumption, not prove it.

                                      If the DieHard routines, for example, fail to disprove our assumption then we may conclude that no evidence has been found to cast doubt upon our assumption.

                                      Given that a set of data is random then we may be able to say that it possesses, for example, characteristics x, y and z.

                                      However, given that characteristics x, y and z exist does not prove randomness. If one or more of the characteristics does not exist then this will cast doubt on the randomness of the data.

                                      If we fall into the body of a probability distribution then we are not able to draw any conclusions because that is where we expected to be. On the other hand, if we fall into a tail of a probability distribution then we may draw conclusions because that is where we are not expected to be.

                                      If we have a process where there is some evidence that it is not random we could have another process where we have stronger evidence that it, in turn, is not random. What we cannot say is that the first process is more random than the second when we doubt that the first process is indeed random.

                                      We can grade the doubt of randomness but we cannot grade randomness itself.
                                      Last edited by David Roberts; 10 Apr 2008, 11:50 PM.

                                      Comment


                                      • #20
                                        >>... We can grade the doubt of randomness but we cannot grade randomness itself.

                                        That's an interesting point, because once you have an algorithm that tests without statistical flaws, can any other algorithm surpass it? It seems you could reasonably base a further criterion on period length, saying that a flawless algo with a period of 2^256 is "better" than one with a period of 2^64. But other than that, once you've hit the ceiling, there's apparently no place higher to go.

                                        If I had to make a linear scale of randomness doubt confidence of various generators--especially if we include mechanical "true randomness" generators--with E being "I am virtually certain that data is not random," to A being "That data looks random to me," here is my take:
                                        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
                                        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
                                        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
                                        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEDCBA.
                                        The data is a statistical disaster 99+% of the time. If you doubt this, create a few generators yourself and then test them. Prepare to be apically humbled.

                                        There is one other characteristic called divergence that can distinguish prngs. Some generators--the Mersenne Twister in particular--can get into regions of very low change where they become statistically flawed. But if you avoid those areas (an easy task given a period of over 2^19000) that flaw won't appear.
                                        Last edited by John Gleason; 11 Apr 2008, 09:04 AM.

                                        Comment

                                        Working...
                                        X