The following is from Further scramblings of Marsaglia's xorshift generators published in April 2014.. If you click on that you'll download a pdf. It is the last entry in Wikipedia's Xorshift references.
Not being a 'C' guy I found 's[1] = ..." a bit intimidating but found the following interpretation here a little more readable.
Checking the second code line for line it does the same job as the first code.
PB has no idea what uint64_t is; being a 32 bit compiler. However, it does know what the mmx registers are. These are the 64 bit registers which overlay the FPU registers in a similar way that the 16 bit CPU registers overlay the 32 bit CPU registers. PB may know but I knew nothing of them so had a spot of reading to do if I wanted a PB version of the above code.
I have some working code, based upon the second code, and a 20MB dump tested as follows.
On the face of it it seems that the working code is doing as intended.
I am still very much at the experimental stage and would appreciate one or more of you to check my 'engine' for any errors and suggest improvements.
I use 'Dim sState(0 to 3) as Dword' with s[0], 64 bit, as sState(0) + sState(1) and s[1], 64 bit, as sState(2) + sState(3).
I then use
and here is my 'engine'.
I have other code to generate single precision values. The first request generates a 64 bit random number of which one dword is used. The next request uses the other dword. What we have then is a 'Big crunch' pass, a 'Little crunch' pass, a 'Big crunch' pass and so on.
Is this worth the effort? It looks like it. My CMWC256 is 1.77 times faster than RND. Xorshift128+ is coming in at 2.85 times faster than RND.
This is what my machine knocks out.
RND: 83 million singles per second.
CMWC256: 147 million singles per second
Xorshift128+: 237 million singles per second.
In addition, xorshift128+ knocks out double precision values a litlle more than twice as fast as RND knocks out singles.
I haven't tested a long range yet but expect that to go through the roof.
I'd like to get that 237 to 250 so that I can say 1GB (decimal definition) singles in four seconds.
Some time this year Sebastiano Vigna introduced xoroshiro128+ which is even faster but since the ink isn't dry on that yet I'd rather leave it for a while. The JavaScript engines of Chrome, Firefox and Safari are based on xorshift128+.
So, gentlemen knock the stuffing out of my 'engine'. It is the first mmx code that I have written so, assuming no cardinal errors, may lend itself to some improvement.
I have posted in the Programming forum, as opposed to the Source Code Library, because I don't want anyone to rely on it.
My ultimate aim is to produce a SLL along the lines of CMWC256. It will not be intended to replace CMWC256 as Xorshift128+ has a period of 2^128-1 compared with CMWC256's period of 2^8208. 2^128-1 is not small - 2^128 = (2^32)^4 = RND^4.
Code:
#include <stdint.h> uint64_t s[2]; uint64_t next(void) { uint64_t s1 = s[0]; const uint64_t s0 = s[1]; s[0] = s0; s1 ^= s1 << 23; // a s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c return s[1] + s0; }
Code:
uint64_t a = s[0]; uint64_t b = s[1]; s[0] = b; a ^= a << 23; a ^= a >> 18; a ^= b; a ^= b >> 5; s[1] = a; return a + b;
PB has no idea what uint64_t is; being a 32 bit compiler. However, it does know what the mmx registers are. These are the 64 bit registers which overlay the FPU registers in a similar way that the 16 bit CPU registers overlay the 32 bit CPU registers. PB may know but I knew nothing of them so had a spot of reading to do if I wanted a PB version of the above code.
I have some working code, based upon the second code, and a 20MB dump tested as follows.
Code:
xorshift128+ Chi square distribution for 20971520 samples is 259.74, and randomly would exceed this value 40.59 percent of the times. Arithmetic mean value of data bytes is 127.5088 (127.5 = random). Monte Carlo value for Pi is 3.141918482 (error 0.01 percent). Serial correlation coefficient is 0.000092 (totally uncorrelated = 0.0). Intel RdRand Chi square distribution for 20971520 samples is 286.86, and randomly would exceed this value 8.30 percent of the times. Arithmetic mean value of data bytes is 127.5151 (127.5 = random). Monte Carlo value for Pi is 3.141483606 (error 0.00 percent). Serial correlation coefficient is 0.000111 (totally uncorrelated = 0.0).
I am still very much at the experimental stage and would appreciate one or more of you to check my 'engine' for any errors and suggest improvements.
I use 'Dim sState(0 to 3) as Dword' with s[0], 64 bit, as sState(0) + sState(1) and s[1], 64 bit, as sState(2) + sState(3).
I then use
Code:
BigS0Ptr = Varptr( sState(0) ) BigS1Ptr = Varptr( sState(2) )
Code:
!mov eax, BigS0Ptr !movq mm6, [eax] ' a = s[0] !mov edx, BigS1Ptr !movq mm7, [edx] ' b = s[1] !movq [eax], mm7 ' s[0] = b !movq mm0, mm6 ' mm0 = a !psllq mm0, 23 ' a << 23 !movq mm1, mm6 ' mm1 = a !pxor mm1, mm0 ' mm1 = a ^= a << 23 !movq mm0, mm1 ' copy mm1 !psrlq mm1, 18 ' a >> 18 !pxor mm0, mm1 ' a = ^= a >> 18 !pxor mm0, mm7 ' a = ^= b !movq mm6, mm7 ' copy mm7 ie b !psrlq mm7, 5 ' b >> 5 !pxor mm0, mm7 ' a ^= b >> 5 !mov eax, BigS1Ptr !movq [eax], mm0 ' s[1] = a !movq2dq xmm0, mm0 ' a !movq2dq xmm1, mm6 ' b !emms !paddq xmm0, xmm1 ' a + b; xmm0 is now our 64 bit random number
Is this worth the effort? It looks like it. My CMWC256 is 1.77 times faster than RND. Xorshift128+ is coming in at 2.85 times faster than RND.
This is what my machine knocks out.
RND: 83 million singles per second.
CMWC256: 147 million singles per second
Xorshift128+: 237 million singles per second.
In addition, xorshift128+ knocks out double precision values a litlle more than twice as fast as RND knocks out singles.
I haven't tested a long range yet but expect that to go through the roof.
I'd like to get that 237 to 250 so that I can say 1GB (decimal definition) singles in four seconds.
Some time this year Sebastiano Vigna introduced xoroshiro128+ which is even faster but since the ink isn't dry on that yet I'd rather leave it for a while. The JavaScript engines of Chrome, Firefox and Safari are based on xorshift128+.
So, gentlemen knock the stuffing out of my 'engine'. It is the first mmx code that I have written so, assuming no cardinal errors, may lend itself to some improvement.
I have posted in the Programming forum, as opposed to the Source Code Library, because I don't want anyone to rely on it.
My ultimate aim is to produce a SLL along the lines of CMWC256. It will not be intended to replace CMWC256 as Xorshift128+ has a period of 2^128-1 compared with CMWC256's period of 2^8208. 2^128-1 is not small - 2^128 = (2^32)^4 = RND^4.
Comment