Xorshift128+ PRNG

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • David Roberts
    Member
    • Apr 2003
    • 5723

    Xorshift128+ PRNG

    The following is from Further scramblings of Marsaglia's xorshift generators published in April 2014.. If you click on that you'll download a pdf. It is the last entry in Wikipedia's Xorshift references.

    Code:
    #include <stdint.h>
    uint64_t s[2];
    uint64_t next(void) {
    uint64_t s1 = s[0];
    const uint64_t s0 = s[1];
    s[0] = s0;
    s1 ^= s1 << 23; // a
    s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
    return s[1] + s0;
    }
    Not being a 'C' guy I found 's[1] = ..." a bit intimidating but found the following interpretation here a little more readable.

    Code:
    uint64_t a = s[0];
    uint64_t b = s[1];
    
    s[0] = b;
    a ^= a << 23;
    a ^= a >> 18;
    a ^= b;
    a ^= b >>  5;
    s[1] = a;
    
    return a + b;
    Checking the second code line for line it does the same job as the first code.

    PB has no idea what uint64_t is; being a 32 bit compiler. However, it does know what the mmx registers are. These are the 64 bit registers which overlay the FPU registers in a similar way that the 16 bit CPU registers overlay the 32 bit CPU registers. PB may know but I knew nothing of them so had a spot of reading to do if I wanted a PB version of the above code.

    I have some working code, based upon the second code, and a 20MB dump tested as follows.

    Code:
    xorshift128+
     
    Chi square distribution for 20971520 samples is 259.74, and randomly
    would exceed this value 40.59 percent of the times.
     
    Arithmetic mean value of data bytes is 127.5088 (127.5 = random).
    Monte Carlo value for Pi is 3.141918482 (error 0.01 percent).
    Serial correlation coefficient is 0.000092 (totally uncorrelated = 0.0).
     
    Intel RdRand
     
    Chi square distribution for 20971520 samples is 286.86, and randomly
    would exceed this value 8.30 percent of the times.
     
    Arithmetic mean value of data bytes is 127.5151 (127.5 = random).
    Monte Carlo value for Pi is 3.141483606 (error 0.00 percent).
    Serial correlation coefficient is 0.000111 (totally uncorrelated = 0.0).
    On the face of it it seems that the working code is doing as intended.

    I am still very much at the experimental stage and would appreciate one or more of you to check my 'engine' for any errors and suggest improvements.

    I use 'Dim sState(0 to 3) as Dword' with s[0], 64 bit, as sState(0) + sState(1) and s[1], 64 bit, as sState(2) + sState(3).

    I then use
    Code:
    BigS0Ptr = Varptr( sState(0) )
    BigS1Ptr = Varptr( sState(2) )
    and here is my 'engine'.

    Code:
    !mov eax, BigS0Ptr
    !movq mm6, [eax] ' a = s[0]
    !mov edx, BigS1Ptr
    !movq mm7, [edx] ' b = s[1]
    !movq [eax], mm7 ' s[0] = b
    !movq mm0, mm6 ' mm0 = a
    !psllq mm0, 23 ' a << 23
    !movq mm1, mm6 ' mm1 = a
    !pxor mm1, mm0 ' mm1 = a ^= a << 23
    !movq mm0, mm1 ' copy mm1
    !psrlq mm1, 18 ' a >> 18
    !pxor mm0, mm1 ' a = ^= a >> 18
    !pxor mm0, mm7 ' a = ^= b
    !movq mm6, mm7 ' copy mm7 ie b
    !psrlq mm7, 5 ' b >> 5
    !pxor mm0, mm7 ' a ^= b >> 5
    !mov eax, BigS1Ptr
    !movq [eax], mm0 ' s[1] = a
    !movq2dq xmm0, mm0 ' a
    !movq2dq xmm1, mm6 ' b
    !emms
    !paddq xmm0, xmm1 ' a + b; xmm0 is now our 64 bit random number
    I have other code to generate single precision values. The first request generates a 64 bit random number of which one dword is used. The next request uses the other dword. What we have then is a 'Big crunch' pass, a 'Little crunch' pass, a 'Big crunch' pass and so on.

    Is this worth the effort? It looks like it. My CMWC256 is 1.77 times faster than RND. Xorshift128+ is coming in at 2.85 times faster than RND.

    This is what my machine knocks out.

    RND: 83 million singles per second.
    CMWC256: 147 million singles per second
    Xorshift128+: 237 million singles per second.

    In addition, xorshift128+ knocks out double precision values a litlle more than twice as fast as RND knocks out singles.

    I haven't tested a long range yet but expect that to go through the roof.

    I'd like to get that 237 to 250 so that I can say 1GB (decimal definition) singles in four seconds.

    Some time this year Sebastiano Vigna introduced xoroshiro128+ which is even faster but since the ink isn't dry on that yet I'd rather leave it for a while. The JavaScript engines of Chrome, Firefox and Safari are based on xorshift128+.

    So, gentlemen knock the stuffing out of my 'engine'. It is the first mmx code that I have written so, assuming no cardinal errors, may lend itself to some improvement.

    I have posted in the Programming forum, as opposed to the Source Code Library, because I don't want anyone to rely on it.

    My ultimate aim is to produce a SLL along the lines of CMWC256. It will not be intended to replace CMWC256 as Xorshift128+ has a period of 2^128-1 compared with CMWC256's period of 2^8208. 2^128-1 is not small - 2^128 = (2^32)^4 = RND^4.
    Last edited by David Roberts; 23 May 2016, 07:20 PM.
  • David Roberts
    Member
    • Apr 2003
    • 5723

    #2
    Long Range

    CMWC256: 3.84 x RND(a,b)
    Xorshift128+: 6.24 x RND(a,b)

    On my machine

    RND(a,b): 36 million per second
    Xorshift128+: 226 million per second

    assuming no flaws in my engine.

    Comment

    • Stuart McLachlan
      Member
      • Mar 2000
      • 9468

      #3
      > PB has no idea what uint64_t is; being a 32 bit compiler.

      How about using a QUAD?
      =========================
      https://camcopng.com
      =========================

      Comment

      • Michael Mattias
        Member
        • Aug 1998
        • 43447

        #4
        >Is this worth the effort? It looks like it. My CMWC256 is 1.77 times faster than RND. Xorshift128+ is coming in at 2.85 times faster than RND.

        How many times do you have to do it, how often? How many "minutes saved per day" or "seconds saved per average run" does that represent? can you put a dollar figure on it? Or maybe "user satisfaction points?"

        That will tell you if it is "worth it" or not...and depending on what measure you use for "worth."

        e.g., if this runs ten times per day obtaining one value each time, any work over and above "simplest code" is probably not worth it. OTOH, if this has to generate 100 million numbers while a user is waiting for it, now a speed improvement probably is worth it, as nothing bothers users as much as an hourglass cursor and the software apparently "not doing anything."

        But 100 million numbers needed, run automatically, unattended, after business hours? Who really cares if it takes an extra minute or two?

        I think I "may" have posited from time to time, "all optimization is application-specific;" this would be no exception.
        Michael Mattias
        Tal Systems (retired)
        Port Washington WI USA
        [email protected]
        http://www.talsystems.com

        Comment

        • Eric Pearson
          Member
          • Oct 1987
          • 10347

          #5
          I have hidden two posts, pending the resolution of a conflict between two members.
          "Not my circus, not my monkeys."

          Comment

          • David Roberts
            Member
            • Apr 2003
            • 5723

            #6
            My removed post had two responses.

            My response to Stuart was:

            Hi Stuart

            How about using a QUAD?
            uint64_t is unsigned, QUAD is signed.

            The response to Mr Mattias was a little longer.

            Comment

            • John Gleason
              Member
              • Aug 2002
              • 2192

              #7
              Hey David:
              The code looks good and matches output exactly with the all-PB test code below. But as I noted below, my code is perhaps 20x slower or more than your code, so just use it if you want to verify your code.

              This also kind of addresses Stuart's thought to use QUAD's. Yes, it can be done, but is very slow. You could maybe even do it as an unsigned QUAD in an EXT data type, but all the shifts would need to be done by multiplication and division, and the XOR's have to be outsourced so that would probably be even slower still.

              And as a matter of fact, your code generates the fastest pseudo random bytes of any (decent) generator I've ever seen, even faster by maybe 40 or 50% than the previous fastest asm mult-with-carry 2^63 LONG's period Marsaglia generator we've used before. Congrats!

              Code:
              'check version for David's code above.
              'This is done using regular signed quads and is, what, 20 or more times slower than David's code.
              #COMPILE EXE
              #DIM ALL
              
              FUNCTION PBMAIN () AS LONG
                 LOCAL ii, bigS0ptr, bigS1ptr, htxt, qPtr AS LONG
                 LOCAL s0, s1, a, b AS QUAD
                 LOCAL shlA23, shrA18, shrB5, randQ, randQ2 AS QUAD
                 TXT.WINDOW("XorShift+ Rand Generator - hit ESC to exit", 150, 100) TO hTxt
                 DIM qArr(&h100000-1) AS QUAD
              
              '  OPEN "xorShiftRandCheck.dat" FOR BINARY LOCK WRITE AS #1   'uncomment to write rands to a file
              
                 s0 = &h63F62BE4D75E5D7B  'two random quad seeds. They will be handled tho as unsigned, not signed as PB does normally
                 s1 = &hE1DA37F3B33B6258  '        "                           "                                     "
                 bigS0ptr = VARPTR(s0)
                 bigS1ptr = VARPTR(s1)
              
                 DO
                    a = s0                                      'do the assigns, SHIFT's, and XOR's ...
                    b = s1
                    s0 = b
              
                    shlA23 = a
                    SHIFT LEFT shlA23, 23
                    a XOR= shlA23
              
                    shrA18 = a                                  '             "              "      ...
                    SHIFT RIGHT shrA18, 18
                    a XOR= shrA18
              
                    a XOR= b
              
                    shrB5 = b
                    SHIFT RIGHT shrB5, 5
                    a XOR= shrB5
              
                    s1 = a                                      '             "              "
              
                    randQ = LO(DWORD, a) + LO(DWORD, b)         'and finally add the two quads; ignore the possible bit-64 carry
                    randQ2 = HI(DWORD, a) + HI(DWORD, b)
                    IF randQ >= &h100000000 THEN INCR randQ2    'but here can't ignore the bit-32 carry, which must be added if present
                    qPtr = VARPTR(qArr(ii))
                    POKE DWORD, qPtr, LO(DWORD, randQ)          'and save the resultant DWORD totals to the output random quad: qArr(ii)
                    POKE DWORD, qPtr+4, LO(DWORD, randQ2)       'which is the same hex value as if it were an unsigned quad
              
                    INCR ii
                    IF ii = &h100000 THEN
                       TXT.PRINT "o";
                       ii = 0
                       PUT #1,, qArr()
                       IF TXT.INKEY$ = $ESC THEN EXIT FUNCTION
                    END IF
                 LOOP
              
              
              END FUNCTION

              Comment

              • David Roberts
                Member
                • Apr 2003
                • 5723

                #8
                Thank you, John.

                I will now knock out xorshift128+.sll. All I have to do is lift the CMWC256 1.6L engine out of CWMC256.sll and drop in the xorshift128+ 2.0L engine.

                After I have done that I do not think that I will be able to resist putting in a xoroshiro128+ 2.5L engine. From what I have read xoroshiro128+ may be 20% faster than xorshift128+. 280 million plus singles per second, perhaps? If I can do that we will sail past 1Gig in 4 seconds.

                Monte Carlo methods will love that.

                Mr Mattias: Check out Monte Carlo method.

                Uses of Monte Carlo methods require large amounts of random numbers, and it was their use that spurred the development of pseudorandom number generators, which were far quicker to use than the tables of random numbers that had been previously used for statistical sampling.

                Comment

                • David Roberts
                  Member
                  • Apr 2003
                  • 5723

                  #9
                  Ran your prog John and told it to pull out after printing 25 o's - 200MB.

                  ------
                  Chi square distribution for 209715200 samples is 274.57, and randomly
                  would exceed this value 19.09 percent of the times.

                  Arithmetic mean value of data bytes is 127.5027 (127.5 = random).
                  Monte Carlo value for Pi is 3.141237861 (error 0.01 percent).
                  Serial correlation coefficient is 0.000004 (totally uncorrelated = 0.0).
                  ------

                  No complaints there - the serial correlation coefficient has nearly vanished.

                  Comment

                  • Michael Mattias
                    Member
                    • Aug 1998
                    • 43447

                    #10
                    I interpreted ....
                    Is this worth the effort?
                    ... as a non-rhetorical question soliciting opinion.

                    I have been answering that question for 40+ years on a professional basis. I felt obligated to 'give something back.'
                    Michael Mattias
                    Tal Systems (retired)
                    Port Washington WI USA
                    [email protected]
                    http://www.talsystems.com

                    Comment

                    • David Roberts
                      Member
                      • Apr 2003
                      • 5723

                      #11
                      It was a rhetorical question.

                      If you do not need a fast generator then you do not need a fast generator.
                      If you need a fast generator then you need a fast generator.

                      In the first case Xorshift128+ is worthless.
                      In the second case Xorshift128+ is worth a great deal.

                      A glass of water is worthless in a bar but worth a great deal in the Kalahari Desert.

                      I don't target - I just code. If it takes your fancy then it takes your fancy - if it doesn't, it doesn't.

                      Comment

                      • David Roberts
                        Member
                        • Apr 2003
                        • 5723

                        #12
                        After I have done that I do not think that I will be able to resist putting in a xoroshiro128+ 2.5L engine.
                        That is not going to happen.

                        Note that the generator uses a simulated rotate operation, which most C compilers will turn into a single instruction. In Java, you can use Long.rotateLeft(). In languages that do not make low-level rotation instructions accessible xorshift128+ could be faster.
                        In our case the simulated rotate instruction remains simulated resulting in xoroshiro128+ being 6% slower than xorshift128+.

                        Comment

                        • Gene Morgan
                          Member
                          • Jan 2016
                          • 2

                          #13
                          David,

                          Thanks so much for posting and sharing your work. Seeing how you deconstructed the C code was really helpful. I hope folks won't mind a some questions.

                          Q. Do we know what algorithm is used in PowerBASIC for it's RND function? I'm going to assume that for randon number sequences of a couple of dozen numbers it is fine and generates suitable values for card game simulations and dice roles. Maybe not so for Monty Carlo simulations or AI specific applications?

                          Q. Is the xorshift128+ algorithem considered the current state of the art in PRNG. I know things are constantly evolving so maybe not a valid question but I will ask anyway.

                          Q. How do you suspose it compares with the PRNG in Microsoft Excel? I've done a bit of research into how Excel does random numbers and Microsoft won't reveal their secrets so all I've been able to find is speculation and no real evaluation as to the quality of Excel's PRNG.

                          Gene

                          Comment

                          • David Roberts
                            Member
                            • Apr 2003
                            • 5723

                            #14
                            Hi Gene

                            Do we know what algorithm is used in PowerBASIC for it's RND function?
                            No, but I suspect that it is a Linear Congruential Generator (LCG).

                            I'm going to assume that for randon number sequences of a couple of dozen numbers it is fine and generates suitable values for card game simulations and dice roles
                            Yes, it is. However, I take the view that since Xorshift128+ is so easy to add to our applications then whether or not RND is suitable is rendered academic.

                            Is the xorshift128+ algorithem considered the current state of the art in PRNG.
                            Xorshift128+ has been superceded by Xoroshiro128+ according to the author. However, this seems to be the case for very large requests of random numbers ie Terabytes. For more down to earth requests I reckon that Xorshift128+ is the better with PB implementations.

                            How do you suspose it compares with the PRNG in Microsoft Excel?
                            Microsoft do not hurry in taking on board new ideas. AES and SHA2 were a long time coming. Having said that their cryptographic generator has kept very good pace with NIST recommendations over the years. I suspect then that the PRNG in Excel is far from state of the art.

                            Check out the follow up to this thread here and its related comments thread.

                            Comment

                            • David Roberts
                              Member
                              • Apr 2003
                              • 5723

                              #15
                              suitable values for card game simulations and dice roles
                              In such cases, where speed of generation is not a requirement, then I would favour a CPRNG over a PRNG.

                              Here are three:

                              XP SP3 and above: RtlGenRandom: Singles, Doubles and Range post #9
                              Vista and above: BCryptGenRandom: Singles, Doubles and Range post #2
                              XP SP3 and above & Intel's RdRand: Random numbers par excellence on steroids post #1 with revision at post #4

                              All use two buffers: Whilst one is being stripped the other is being filled. The other's first half is filled by one thread of execution and the other's second half is filled by another thread of execution.

                              Whilst not as fast as RND for single precision but still a respectable speed all three are actually faster with range ie the equivalent of RND's RND(a,b).

                              Comment

                              • John Gleason
                                Member
                                • Aug 2002
                                • 2192

                                #16
                                I stumbled across this test program called Practically Random then tried it out, and surprisingly ran into an apparent problem. It seems to fail a test called B-rank consistently. I've tried it several times and don't think it's a fluke. Maybe you could check it out too. http://pracrand.sourceforge.net/

                                Its usage syntax is demo'd below in the lines beginning with C:\... (you would use your own path), as is its output which follows. It runs in a console window.
                                Code:
                                C:\msvc12_32bit>type xorshift.dat | rng_test.exe stdin
                                RNG = RNG_stdin, PractRand version 0.92, seed = 0xd250a6e9
                                test set = normal, folding = standard(unknown format)
                                
                                rng=RNG_stdin, seed=0xd250a6e9
                                length= 32 megabytes (2^25 bytes), time= 2.0 seconds
                                  no anomalies in 130 test result(s)
                                
                                rng=RNG_stdin, seed=0xd250a6e9
                                length= 64 megabytes (2^26 bytes), time= 5.0 seconds
                                  Test Name                         Raw       Processed     Evaluation
                                  [Low4/32]BCFN(2+1,13-5,T)         R= +10.0  p =  4.1e-4   unusual
                                  [Low1/32]BRank(12):384(0)         R= +1272  p~=  5.4e-384   FAIL !!!!!!!
                                  ...and 137 test result(s) without anomalies
                                
                                C:\msvc12_32bit>type xorshift+rands.dat | rng_test.exe stdin
                                RNG = RNG_stdin, PractRand version 0.92, seed = 0x51b6ab5d
                                test set = normal, folding = standard(unknown format)
                                
                                rng=RNG_stdin, seed=0x51b6ab5d
                                length= 32 megabytes (2^25 bytes), time= 2.0 seconds
                                  no anomalies in 130 test result(s)
                                
                                rng=RNG_stdin, seed=0x51b6ab5d
                                length= 64 megabytes (2^26 bytes), time= 5.0 seconds
                                  Test Name                         Raw       Processed     Evaluation
                                  [Low1/32]BRank(12):384(0)         R= +1272  p~=  5.4e-384   FAIL !!!!!!!
                                  ...and 138 test result(s) without anomalies
                                
                                rng=RNG_stdin, seed=0x1134e0a2
                                length= 512 megabytes (2^29 bytes), time= 36.0 seconds
                                  no anomalies in 171 test result(s)
                                ...and in another test where I cut out the first 100MB to see if it was some kind of seeding problem,
                                it made it to one GB, but then:
                                Code:
                                rng=RNG_stdin, seed=0x1134e0a2
                                length= 1 gigabyte (2^30 bytes), time= 69.1 seconds
                                  Test Name                         Raw       Processed     Evaluation
                                  [Low1/8]BRank(12):1536(0)         R= +1272  p~=  5.4e-384   FAIL !!!!!!!
                                  ...and 182 test result(s) without anomalies

                                Comment

                                • John Gleason
                                  Member
                                  • Aug 2002
                                  • 2192

                                  #17
                                  >Q. How do you suppose it compares with the PRNG in Microsoft Excel?

                                  Gene, a while back I ran some stat tests on the output of Excel's rand numbers and it passed them all. However, I didn't know about the Practically Random tests, so am unsure how they would fare there. These relatively new tests look to be the strongest and most difficult to pass I've seen.

                                  Comment

                                  • David Roberts
                                    Member
                                    • Apr 2003
                                    • 5723

                                    #18
                                    F:\msvc12_32bit>type dump.txt | rng_test.exe stdin
                                    RNG = RNG_stdin, PractRand version 0.92, seed = 0x39b2362f
                                    test set = normal, folding = standard(unknown format)

                                    That is as far as I got then after a few seconds RNG_test stopped working. The 64bit version also stopped working.

                                    At the end of the BRank explanation we have: "Due to the coarse-grained nature of the results it produces, precise p-values are impossible for most of its subtests."

                                    If that is all that it failed and consistently I would question that particular test.

                                    We are supplying a dump, so what does seeding have to do with it?

                                    Comment

                                    • David Roberts
                                      Member
                                      • Apr 2003
                                      • 5723

                                      #19
                                      Downloaded version 0.91 which is 32 bit only and it worked.

                                      I am getting BRank failures for Xorshift128+ and Xoroshiro128+. I am not happy at getting p~= 5.4e-384 all the time when they do fail.

                                      With regard my CMWC256, a 128MB dump gave this:

                                      F:\msvc12_32bit>type dump.dat | RNG_test stdin
                                      RNG = RNG_stdin, PractRand version 0.91, seed = 0xe32ccc41
                                      test set = normal, folding = standard(unknown format)

                                      rng=RNG_stdin, seed=0xe32ccc41
                                      length= 32 megabytes (2^25 bytes), time= 2.1 seconds
                                      Test Name Raw Processed Evaluation
                                      [Low1/8]DC6-9x1Bytes-1 R= +5.5 p = 7.4e-3 unusual
                                      ...and 129 test result(s) without anomalies

                                      rng=RNG_stdin, seed=0xe32ccc41
                                      length= 64 megabytes (2^26 bytes), time= 4.7 seconds
                                      no anomalies in 139 test result(s)

                                      rng=RNG_stdin, seed=0xe32ccc41
                                      length= 128 megabytes (2^27 bytes), time= 9.3 seconds
                                      no anomalies in 151 test result(s)

                                      That confirms that CMWC256 always was a winner.

                                      I downloaded some quantum random numbers from QRNG Service

                                      F:\msvc12_32bit>type 100MB.dat | RNG_test stdin
                                      RNG = RNG_stdin, PractRand version 0.91, seed = 0x6bb344
                                      test set = normal, folding = standard(unknown format)

                                      rng=RNG_stdin, seed=0x6bb344
                                      length= 32 megabytes (2^25 bytes), time= 2.1 seconds
                                      no anomalies in 130 test result(s)

                                      rng=RNG_stdin, seed=0x6bb344
                                      length= 64 megabytes (2^26 bytes), time= 4.6 seconds
                                      no anomalies in 139 test result(s)

                                      I was hoping to get a BRank failure and then I could knock that test.

                                      Interesting. Thanks, John.

                                      Comment

                                      • David Roberts
                                        Member
                                        • Apr 2003
                                        • 5723

                                        #20
                                        I have been dumping DWORDS. The analysis is, of course, on bytes

                                        If I dump 128MB of Rand.R(0,255), ie bytes, I don't get any BRank failures. A few tests were done and one 64MB test saw a '[Low1/32]DC6-9x1Bytes-1 R= -5.3 p =1-1.5e-3 unusual', as did the single test on CMWC256, but, other than that no other anomalies.

                                        These relatively new tests look to be the strongest and most difficult to pass I've seen.
                                        A single test on PB's RND(0,255) with 32MB of data saw 15 failures, 4 very suspicious, 3 suspicious and 2 unusual on 130 tests.







                                        Comment

                                        Working...
                                        X