Announcement

Collapse
No announcement yet.

Has anyone ported the PCG PRNG to PowerBASIC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Has anyone ported the PCG PRNG to PowerBASIC

    NOTE: This is duplicate of the request I've posted on the FreeBASIC forums. If someone there has done it, THAT I can port that to PowerBASIC. But the original C code is not something I'm good with.

    Has anyone ported the PCG PRNG (simple or complete version) from C99 to a version for PowerBASIC that they would like to share?

    PCG, A Family of Better Random Number Generators
    http://www.pcg-random.org/

    Not really a C guy ... so wanted to know before I give it a try.

    Melissa O'Neill has created an amazing family of PRNG's and CPRNG's, written in pure C99 and/or C++ (no ASM involved), that are faster and have longer periods of non-repeating sequences than any other RNG out there. These include PRNGs that can provide multiple non-collision streams from a single instance.

    She has not created an academic exercise or just published a couple of pseudo-code algorithms. These are multiple RNG algorithms that are coded for the real world, have multiple examples, and work as advertised - and it's all very well documented. [grin]

    She has a long whitepaper (47 pages) on her site that is a must read ... a lot of incredible information concerning her family of RNGs - as well as analysis, tests and statistics comparing her RNGs against all other ones that are known.

    She has a simple version that is incredible in itself, but her others versions are beyond compare. I've compiled them and her demo/test applications to see them in operation - but I've never ported C code to BASIC code - (but I'm getting ready to try unless someone else has already done so).

    (In case no one's noticed - I'm really getting interested in this random number generation stuff - [grin])

    Thanks,
    BEH

  • #2
    Try this

    https://forum.powerbasic.com/forum/u...cussion-csprng

    Comment


    • #3
      I have done a fair amount with random number generation on this forum over the years including Marsaglia's Complimentary-Multiply-With-Carry (CMWC) base 2^32-1, following John Gleason's introduction to MWC, and more recently Xorshift128+ and Xoroshiro128+. I have shied away from the latter two as they fail PractRand at 64GB and I have not been able to correct it. However, I have not looked at PCG and probably won't - I have too much on at the moment. If time permits I will have a look though.

      I have done some work over at the FreeBASIC form. As you probably know their implementation of the Mersenne Twister is written in FreeBASIC. It has a glaring weakness, as most implemetations, by using only a 32 bit seed. I changed that to 624 32 bit seeds to saturate the state vector and replaced the 'engine' with assembler.

      I ported my PowerBASIC CryptoRndBufferCNG to FreeBASIC. This uses Microsoft's BcryptGenRandom and two buffers. Whilst one buffer is being read the other is filled in a separate thread of execution. This has recently been updated to thread pooling as opposed to thread creation.

      Here is a comparison with FreeBASIC's bulit in generators and my contributions.
      Code:
      Random number generation
      Millions per second
      
      CRT       1      69.410 C runtime library's rand() function
      Fast      2     107.592 Multiply-with carry
      Mersenne  3      87.402 Mersenne Twister
      Qbasic    4      99.558 Similar to QBasic's method
      RndMTS          123.970 Mersenne Twister ASM engine
      CryptoRnd       219.453 BCryptGenRandom Thread creation
      CryptoRndII     507.119 BCryptGenRandom Thread pooling
      As you can see CryptoRndII has cleared the half billion per second barrier. It passes the PractRand suite - tested to one terabyte. It generates Dwords, Singles [0,1), Singles [-1,1], Doubles [0,1), Doubles [-1,1], Range and Gauss (Normal distribution). Being cryptographic no seeding is required.

      I see that you have been a member of FreeBASIC since 2007 so you probably know a lot more about the language than I do. That and "I'm really getting interested in this random number generation stuff - [grin])" makes you an ideal candidate to Convert CryptoRndII to PowerBASIC. I am a bit 'strapped' for time to do it.

      CryptoRndII is in the 'A fast CPRNG' thread beginning at this post with the source code at the end of the post. The prior posts are about CryptoRnd, CryptoRndII's predesessor, based upon CryptoRndBufferCNG.

      Comment


      • #4
        Hi David,

        I appreciate the links. Although one of the reasons that I was looking at PCG was that there was no ASM involved. The last assembler I worked with was IBM 360 back in 1982. And the last C coding I did was back in 1984. Things have changed since then - huh.

        But - newsflash! ... counting_pine just put up a port of the basic version of PCG to FB on that other forum.

        http://www.freebasic.net/forum/viewt...=25718#p232846

        Now I can port it to PowerBASIC and play where I'm comfortable.

        NOTE:
        The ported RND() function [pcg32_random_r()] is 9 lines of code in FB.
        He didn't include the [seed] function in his port - I'll tackle it first.

        By the way, counting_pine's comments:

        From just looking at the minimal code and reading a couple of pages, I'm amazed.

        The principle is ridiculously simple. I would describe it like this:
        1. use a simple, well-understood pseudo-random generator (e.g. linear congruential generator as used in the minimal code)
        2. return a secure hash of the output
        3. there is no step 3!

        The periodicity of the random numbers is the same as that of the simple PRNG, but the secure hashing means that without a massive lookup table you can't find out the internal state based on its output.
        (In the example code's case, it returns 32 bits of output for a 64-bit state anyway, so it's definitely a one-way hash.)

        I don't really understand the hashing method, although it's clearly pretty simple. But from the Party Tricks page it evidently has the property of being able to calculate the input needed to contrive any output. This itself isn't particularly useful when generating random numbers, but it means that all outputs are possible, which is a good property for a random number generator.
        And, from her whitepaper, the simple description of the Ms. O'Neill's algorithm:

        The key idea is to pass the output of a fast well-understood “medium quality” random number generator to an efficient permutation function (a.k.a. hash function), built from composable primitives, that enhances the quality of the output. The algorithm can be applied at variety of bit sizes, including 64 and 128 bits (which provide 32- and 64-bit outputs, with periods of 2 64 and 2 128 ). Optionally, we can provide each b -bit generator with a b - 1 bit stream-selection constant, thereby providing 2 b - 1 random streams, which are full period and entirely distinct. An extension adds up to 2 b -dimensional equidistribution for a total period of 2 b 2 b . The construction of the permutation function and the period-extension technique are both founded on the idea of permutation functions on tuples.

        In its standard variants, b -bit generators use a 2 b /2 -to-1 function to produce b /2 bits of output. These functions can be designed to make it difficult for an adversary to discover the generator’s internal state by examining its output, and thus make it challenging to predict. This property, coupled with the ability to easily switch between random streams, provides many of the benefits provided by cryptographically secure generators without the overheads usually associated with those generators.

        Comment


        • #5
          Hi Bruce

          Have looked at counting_pines's post. l look forward to your full version.

          Comment


          • #6
            In order to get a thread-safe RND function, I have attempted a PB ASM port of the minimal C implementation (https://www.pcg-random.org/download.html).
            It appears to work correctly but I hardly tested it at all, so THERE MAY BE BUGS.

            Speed appears to be comparable to the PCG32R function in the DLL kindly provided by David (https://forum.powerbasic.com/forum/u...921-pcg32-prng).
            David's DLL provides a lot more capability and functionality, but does not seem to be thread-safe.
            Also, to have some PB-usable source code available (even though asm) may allow some further customization for a particular application (example in the next post following this one).

            Code:
            'PBCC 6
            
            #COMPILE EXE
            #DIM ALL
            
            
            TYPE PCG32varsType
                qState AS QUAD
                qSequence AS QUAD
                dwLowerBound AS DWORD
                dwRange AS DWORD
                qOldState AS QUAD
                PRNresult AS LONG
            END TYPE
            
            
            
            FASTPROC GetPCG32_RangedRnd(BYVAL ptrPCG32vars AS LONG)
                'copy qOldState = qState
                ! mov ebx, ptrPCG32vars
                ! mov eax, [ebx]
                ! mov [ebx + 24], eax
                ! mov edx, [ebx + 4]
                ! mov [ebx + 28], edx
                'edx:eax = qState (& ebx = ptrPCG32vars)
            
            
                'Advance internal State 1): qState = qOldState * 636136223846793005&& (discard the upper quad).
                ' (64b * 64b = 128b can be decomposed into three 32b * 32b & two 32b adds since the upper 64b product is discarded.)
                ! mov eax, &H73257F2D
                ! mul edx ;(multiply by eax & put result in edx:eax.)
                ! mov ecx, eax
            
                ! mov edx, [ebx]
                ! mov eax, &H08D40286
                ! mul edx ;(multiply by eax & put result in edx:eax.)
                ! add ecx, eax
            
                ! mov edx, [ebx]
                ! mov eax, &H73257F2D
                ! mul edx ;(multiply by eax & put result in edx:eax.) keep eax & carry edx.
                ! add edx, ecx ;accumulate carry.
                'edx:eax = new qState (& ebx = ptrPCG32vars)
            
            
                'Advance internal State 2): qState = qState + qSequence (discard overflow).
                ! add eax, [ebx + 8]
                ! adc edx, [ebx + 12]
                ! mov [ebx], eax
                ! mov [ebx + 4], edx
                'edx:eax = (final) new qState (& ebx = ptrPCG32vars)
            
            
                'qXorShifted = qOldState : SHIFT RIGHT qXorShifted, 18
                ! mov edx, [ebx + 28]
                ! mov eax, [ebx + 24]
                ! shrd eax, edx, 18
                ! shr edx, 18
                'edx:eax = qXorShifted (& ebx = ptrPCG32vars)
            
            
                'qXorShifted = qXorShifted XOR qOldState
                ! xor edx, [ebx + 28]
                ! xor eax, [ebx + 24]
            
            
                'SHIFT RIGHT qXorShifted, 27 '(Now XorShifted is effectively a DWORD.)
                ! shrd eax, edx, 27 ;(edx is now discarded.)
                'eax = XorShifted_dwLo (& ebx = ptrPCG32vars)
            
            
                'qNumBitsToRotate = qOldState : SHIFT RIGHT qNumBitsToRotate, 59 '(Now NumBitsToRotate is limited to 5 bits and so can be a BYTE.)
                'ROTATE RIGHT XorShifted_dwLo, NumBitsToRotateDW 'Rnd32 = XorShifted_dwLo
                ! mov ecx, [ebx + 28]
                ! shr cl, 59 - 56
                ! ror eax, cl
                'eax = XorShifted_dwLo, which is the full-32b PRN result. (Also ebx = ptrPCG32vars)
            
            
                'RangedRnd = CLNG(CSNG(@ptrXorShiftedDW) * (CSNG(dwRange - 1) / 4294967296!)) + dwLowerBound
                ! mov edx, [ebx + 20] ;dwRange
                ! mul edx ;(multiply by eax & put result in edx:eax.)
                ! mov ecx, [ebx + 16] ;dwLowerBound
                ! add edx, ecx
                'edx = PRNresult (& ebx = ptrPCG32vars)
            
                ! mov [ebx + 32], edx ;copy edx to PRNresult.
            END FASTPROC
            'GetPCG32_RangedRnd()
            
            
            
            %NumResamplingTrials = 20000
            
            FUNCTION PBMAIN () AS LONG
                'Initialize PCG32.
                LOCAL PCG32vars AS PCG32varsType : LOCAL ptrPCG32vars AS PCG32varsType POINTER : ptrPCG32vars = VARPTR(PCG32vars)
                PCG32vars.qState = &H9 : PCG32vars.qSequence = &HBB : PCG32vars.dwLowerBound = 0 : PCG32vars.dwRange = 301 'dwRange = (UpperBound - LowerBound + 1)
                PCG32vars.qSequence = PCG32vars.qSequence OR 1 '"OR 1" to guarantee qSequence is odd (and/or non-zero).
            
                'Use PCG32.
                LOCAL TrialNum AS LONG
                FOR TrialNum = 1 TO %NumResamplingTrials
                    CALL GetPCG32_RangedRnd(ptrPCG32vars) : STDOUT STR$(PCG32vars.PRNresult)
                NEXT TrialNum
            
                WAITKEY$
            END FUNCTION  'PBMAIN

            The minimal C implementation didn't look too challenging but this port took more time than I expected. In hindsight I should have instead just asked David to expose the State variable in his DLL function so that it could be used in a thread-safe manner.

            ie. I wanted a function in this form:
            FUNCTION GetPCG32_RangedRnd(BYVAL dwLowerBound AS DWORD, BYVAL dwRange AS DWORD, BYVAL qSequence AS QUAD, qState AS QUAD) AS LONG
            where qState is created and intially set with a seed in the calling procedure (which may be threaded), instead of being a static or global.

            (Because I wanted to use a FASTPROC for speed savings, and it only accept two BYVAL LONGs as arguments, I created a TYPE to hold the relevant variables as shown above. Which includes the working variable qOldState because FASTPROC does not allow LOCALs.)

            Obviously, this would be faster with use of the 64b registers & functions. A PB implementation of that using "db" byte sequences to directly instantiate the instruction opcodes is outside my capacity at the moment. But only 5 instructions would actually be needed: movq, mulq, addq, shrq, xorq (and those last 2 would only be used once each so shrq & xorq could likely be dropped from that list with little impact).
            For comparison, a couple of 64b disassemblies for the minimal c code are here: https://www.pcg-random.org/code-size.html

            Comment


            • #7
              One application of a bounded PRN (denoted RangedRnd in the code above) might be for Monte Carlo sampling.
              In this case there is some population of values and RangedRnd is desired as the index for that matrix to get a random sample from it.
              So for MC use a pointer to PopValues() can be provided to the function and the actual random sample returned instead of the RangedRnd.
              It requires only two more asm statements and provides another small speed advantage when measured at the calling level. The changes below from the previous are marked with "<<<".


              Code:
              'PBCC 6
              
              #COMPILE EXE
              #DIM ALL
              
              
              TYPE PCG32varsType
                  qState AS QUAD
                  qSequence AS QUAD
                  dwLowerBound AS DWORD
                  dwRange AS DWORD
                  qOldState AS QUAD
                  PRNresult AS LONG
                  ptrPopMatrix AS LONG POINTER  '<<< ADDED
              END TYPE
              
              
              
              FASTPROC GetPCG32_RndSample(BYVAL ptrPCG32vars AS LONG)
                  'copy qOldState = qState
                  ! mov ebx, ptrPCG32vars
                  ! mov eax, [ebx]
                  ! mov [ebx + 24], eax
                  ! mov edx, [ebx + 4]
                  ! mov [ebx + 28], edx
                  'edx:eax = qState (& ebx = ptrPCG32vars)
              
              
                  'Advance internal State 1): qState = qOldState * 636136223846793005&& (discard the upper quad).
                  ' (64b * 64b = 128b can be decomposed into three 32b * 32b & two 32b adds since the upper 64b product is discarded.)
                  ! mov eax, &H73257F2D
                  ! mul edx ;(multiply by eax & put result in edx:eax.)
                  ! mov ecx, eax
              
                  ! mov edx, [ebx]
                  ! mov eax, &H08D40286
                  ! mul edx ;(multiply by eax & put result in edx:eax.)
                  ! add ecx, eax
              
                  ! mov edx, [ebx]
                  ! mov eax, &H73257F2D
                  ! mul edx ;(multiply by eax & put result in edx:eax.) keep eax & carry edx.
                  ! add edx, ecx ;accumulate carry.
                  'edx:eax = new qState (& ebx = ptrPCG32vars)
              
              
                  'Advance internal State 2): qState = qState + qSequence (discard overflow).
                  ! add eax, [ebx + 8]
                  ! adc edx, [ebx + 12]
                  ! mov [ebx], eax
                  ! mov [ebx + 4], edx
                  'edx:eax = (final) new qState (& ebx = ptrPCG32vars)
              
              
                  'qXorShifted = qOldState : SHIFT RIGHT qXorShifted, 18
                  ! mov edx, [ebx + 28]
                  ! mov eax, [ebx + 24]
                  ! shrd eax, edx, 18
                  ! shr edx, 18
                  'edx:eax = qXorShifted (& ebx = ptrPCG32vars)
              
              
                  'qXorShifted = qXorShifted XOR qOldState
                  ! xor edx, [ebx + 28]
                  ! xor eax, [ebx + 24]
              
              
                  'SHIFT RIGHT qXorShifted, 27 '(Now XorShifted is effectively a DWORD.)
                  ! shrd eax, edx, 27 ;(edx is now discarded.)
                  'eax = XorShifted_dwLo (& ebx = ptrPCG32vars)
              
              
                  'qNumBitsToRotate = qOldState : SHIFT RIGHT qNumBitsToRotate, 59 '(Now NumBitsToRotate is limited to 5 bits and so can be a BYTE.)
                  'ROTATE RIGHT XorShifted_dwLo, NumBitsToRotateDW 'Rnd32 = XorShifted_dwLo
                  ! mov ecx, [ebx + 28]
                  ! shr cl, 59 - 56
                  ! ror eax, cl
                  'eax = XorShifted_dwLo, which is the full-32b PRN result. (Also ebx = ptrPCG32vars)
              
              
                  'RangedRnd = CLNG(CSNG(@ptrXorShiftedDW) * (CSNG(dwRange - 1) / 4294967296!)) + dwLowerBound
                  ! mov edx, [ebx + 20] ;dwRange
                  ! mul edx ;(multiply by eax & put result in edx:eax.)
                  ! mov ecx, [ebx + 16] ;dwLowerBound
                  ! add edx, ecx
                  'edx = PRNresult (& ebx = ptrPCG32vars)
              
                  ! mov ecx, [ebx + 36] ;load ptrPopMatrix into ebx.                                                                       ;<<< ADDED
                  ! mov edx, [ecx + 4*edx] ;(can include up to 2 of the 32b registers with one pre-multipled by 2,4,8 + 32b constant.)     ;<<< ADDED
              
                  ! mov [ebx + 32], edx ;copy edx to PRNresult.  <<<Now a random selection from the Pop matrix.
              END FASTPROC
              'GetPCG32_RndSample()
              
              
              
              
              %NumResamplingTrials = 20000
              
              FUNCTION PBMAIN () AS LONG
                  LOCAL PopValues() AS LONG : DIM PopValues(5) AS LONG
                  PopValues(1) = 45 : PopValues(2) = 87 : PopValues(3) = 123 : PopValues(4) = 120 : PopValues(5) = 70 'Measurements of some kind, such as rainfall inches x10.
              
                  'Initialize PCG32 of PopValues().
                  LOCAL PCG32vars AS PCG32varsType : LOCAL ptrPCG32vars AS PCG32varsType POINTER : ptrPCG32vars = VARPTR(PCG32vars) : PCG32vars.ptrPopMatrix = VARPTR(PopValues(0))  '<<< ADDED
                  PCG32vars.qState = &H9 : PCG32vars.qSequence = &HBB : PCG32vars.dwLowerBound = 1 : PCG32vars.dwRange = UBOUND(PopValues(1)) - PCG32vars.dwLowerBound + 1
                  PCG32vars.qSequence = PCG32vars.qSequence OR 1 '"OR 1" to guarantee qSequence is odd (and/or non-zero).
              
                  'Use PCG32 for MC sampling of PopValues().
                  LOCAL TrialNum, SampleNum, SampleSize AS LONG : SampleSize = 3
                  FOR TrialNum = 1 TO %NumResamplingTrials
                      FOR SampleNum = SampleSize TO 1 STEP -1 : CALL GetPCG32_RndSample(ptrPCG32vars) : STDOUT STR$(PCG32vars.PRNresult); : NEXT SampleNum
                      STDOUT ""
                  NEXT TrialNum
              
                  WAITKEY$
              END FUNCTION  'PBMAIN
              Lastly,note that for bounded values (RangedRnd) there is some small bias introduced to the Pseudo-Random distribution due to rounding. This can be removed by discards; I estimate the time hit to be between 30%-40% for that feature.

              Comment


              • #8
                Originally posted by Daniel
                David's DLL provides a lot more capability and functionality, but does not seem to be thread-safe.
                It isn't thread safe. I have looked at that but haven't found a way to maintain the high level of throughputs. The FreeBASIC version is thread safe, but it uses a UDT which has procedure members, overloaded functions, and 64-bit unsigned integers all over the place and, whence, a dll.

                You could use CryptoRndII which does not use a state vector. Its range function, (Byval One As Long, Byval Two As Long ) As Long, is coming in at 272MHz on my machine - so is no slouch - more than seven times faster than Rnd(a,b). The dll, recently revised, is only 20KB.

                Comment


                • #9
                  Thanks David. However, I infer that CryptoRndII is not deterministic and that is a requirement for my application. Plus I want the "leader board" speed of PCG32.
                  I appreciate our sharing your work on not only the coding of these functions but also the benchmarking, which is very helpful.

                  Comment


                  • #10
                    Originally posted by Daniel
                    I infer that CryptoRndII is not deterministic
                    No, it isn't. CryptoRndII started life at PB on 1 Aug 2017. A follow-up, CryptoRndIII, started life on 2 may 2019.

                    CryptoRndIII has a few more brain cells than CryptoRndII. Have a look at post #14 and #15

                    Two drawbacks: Firstly, we need to know, a priori, the maximum number of random numbers we want and to ensure that we do not exceed that so that switching to the second buffer will not occur. Secondly BCryptGenRandom has been replaced with Intel's RdRand.

                    Having said that, CryptoRndIII is my default generator for PB. The dual buffer usage is fast enough for my purposes; being very close to PB's RND. However, that is not fast enough for you, so the single buffer mode is the one to go for.

                    CryptoRndIII has two range functions: CryptoR(a,b) ( Output Long discrete range a to b ); CryptoFR(a,b) ( Output Double floating-point range [a,b) ). CryptoR(a,b) in single buffer mode has a throughput equal to PCG32R. Once buffers are filled random number requests are satisfied from RAM and no algorithm can beat that. This shows how fast PCG32 is. The full list of procedures is at post #17.

                    I got so little feedback from CryptoRndII I did not bother to go back and add the extra functionality that CryptoRndIII has.

                    Of course, if you do not have Intel RdRand then CryptoRndIII is a non-starter. Intel introduced it in April 2012 and AMD got their finger out in June 2015.

                    Comment


                    • #11
                      ***************** BUG FIX **************
                      I just read the statement from Steve Hutchesson that FASTPROC does not automatically preserve the volatile registers EBX, ESI, EDI, so "! push ebx" & "! pop ebx" need to be added to both functions posted above.(https://forum.powerbasic.com/forum/u...631#post799631)
                      Eventually, after time for any other corrections, I will put the updated code in the source code forum.

                      David, the double buffer in CryptoRndIII sounds like a good technique. I considered something similar when I first discovered (the hard way instead of from the manual) that RND was not thread-safe: creating a global DIM of randoms using PB RND at the start of my program and then using it in all threads. But it seemed too inefficient for what I was trying to do.

                      Further bugs aside, the 32b asm port of PCG32 seems to working well for my app so far.

                      Comment


                      • #12
                        Discussion for a faster and perhaps acceptable alternative to PCG32:
                        https://forum.powerbasic.com/forum/u...cement-for-rnd

                        Comment

                        Working...
                        X