Announcement

Collapse
No announcement yet.

Fast multiplication of complex number arrays

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    one thing to be aware of is that the C++ compiler compiles for the linux platform, so the registers used are different than in Windows
    as the salmon fish is compelled to go back to it's birthplace to spawn, so comes a day when man is compelled to go back to it's source.. GOD

    Comment


    • #22
      one thing to be aware of is that the C++ compiler compiles for the linux platform, so the registers used are different than in Windows
      The C++ compiler you are using may be for Linux; there are also Windows and Mac C++ compilers.

      ????? Register names other than the Intel names???
      Dale

      Comment


      • #23
        what I meant was that the register used for parameter are different, also the non-volatile registers differ
        as the salmon fish is compelled to go back to it's birthplace to spawn, so comes a day when man is compelled to go back to it's source.. GOD

        Comment


        • #24
          the x86 msvc v19.0 (WINE) compiler with options /arch:SSE2 /O2 does a decent job, an excellent feature is that if you hover the mouse pointer over an instruction it will explain what that instruction does
          as the salmon fish is compelled to go back to it's birthplace to spawn, so comes a day when man is compelled to go back to it's source.. GOD

          Comment


          • #25
            Here is an SSE version.
            There are issues with it so it might not be suitable.

            It runs about twice as fast as the ASM/FPU version, 6x as fast as the original.
            BUT..
            1) the FPU actually calculates in EXTENDED precision then rounds results. SSE calculates in SINGLE precision.
            This means there are small rounding differences so results are not identical.

            2) SSE requires that the data in the arrays is aligned on a 16 byte memory boundary and this is hit and miss with the compiler.
            Sometimes it aligns properly and sometimes it's 8 bytes out.

            3) I've changed the arrays to be zero based.


            If you can manipulate your arrays to always be zero based, aligned on a 16 byte boundary and you can tolerate the rounding differences then it might be worth using.

            One way to handle the array alignment if you can't fix it is to check the alignment first and if it's out then do the first and last values in BASIC and then the aligned bulk of the array with SSE then add the results at the end.

            Code:
            'PBCC6 program
            
            TYPE Complex
              r AS SINGLE
              i AS SINGLE
            END TYPE
            
            
            %ArraySize = 10000000
            %SizeToProcess = 10000000
            
            FUNCTION PBMAIN
            
            LOCAL r AS LONG
            LOCAL k AS QUAD
            LOCAL Result AS Complex
            
            LOCAL junk() AS BYTE
            DIM junk(700)
            
            LOCAL A1(),A2() AS Complex
            PRINT
            
            
            DIM A1(%ArraySize)
            DIM A2(%ArraySize)
            
            
            PRINT HEX$(VARPTR(A1(0)))
            PRINT HEX$(VARPTR(A2(0)))
            
            'fill arrays with test data
            RANDOMIZE(1)
            
            FOR r = 0 TO %ArraySize
                a1(r).r = (RND -0.5)*1000
                a1(r).i = (RND -0.5)*1000
                a2(r).r = (RND -0.5)*1000
                a2(r).i = (RND -0.5)*1000
            
            NEXT
            
            
            
            PRINT
            PRINT "BASIC-#######################################################"
            TIX k
            CMultiply(%SizeToProcess, a1(), a2(), Result)
            TIX END K
            
            PRINT "Result original code= ";Result.r, Result.i
            PRINT "tix=";FORMAT$(k,"###,###,###,###,###")
            PRINT
            
            
            
            PRINT "Asm-###############################################"
            
            TIX k
            CMultiplyASM(%SizeToProcess, a1(), a2(), Result)
            TIX END k
            
            PRINT "     ResultASM code = ";Result.r, Result.i
            PRINT "tix=";FORMAT$(k,"###,###,###,###,###")
            PRINT
            
            
            PRINT "SSE-#######################################################################"
            
            TIX k
            CMultiplySSE3(%SizeToProcess, a1(), a2(), Result)
            TIX END k
            PRINT
            
            
            PRINT "     ResultSSE code = ";Result.r, Result.i
            PRINT "tix=";FORMAT$(k,"###,###,###,###,###")
            
            WAITKEY$
            
            END FUNCTION
            
            
            
            
            
            
            
            SUB CMultiply (BYVAL n AS LONG, C1() AS Complex, C2() AS Complex, BYREF CAvg AS Complex)
            REGISTER x AS LONG, SumR AS EXT, SumI AS EXT
              SumR = 0
              SumI = 0
              FOR x = 0 TO n-1
                SumR = SumR + C1(x).r * C2(x).r - C1(x).i * C2(x).i
                SumI = SumI + C1(x).r * C2(x).i + C1(x).i * C2(x).r
              NEXT x
              CAvg.r = SumR / n
              CAvg.i = SumI / n
            END SUB
            
            
            
            
            
            
            
            SUB CMultiplyASM (BYVAL n AS LONG, C1() AS Complex, C2() AS Complex, BYREF CAvg AS Complex)
            #REGISTER NONE
            LOCAL SumR AS EXT, SumI AS EXT
            LOCAL pC1, pC2, Limit AS LONG
            
              SumR = 0
              SumI = 0
            
            
              Limit = n -1
            
              pC1 = VARPTR(C1(0))
              pC2 = VARPTR(C2(0))
            
            
              !mov eax,pC1
              !mov edx,pC2
                                              'ST0        ST1        ST2        ST3        ST4        ST5        ST6        ST7
              !fld SumI                       'SumI
              !fld SumR                       'SumR       SumI
            
              !mov ecx,0
            #ALIGN 16
              lp1:
              !fld dword ptr [eax+ecx*8]      'C1R        SumR       SumI
              !fld dword ptr [eax+ecx*8+4]    'C1I        C1R        SumR       SumI
              !fld dword ptr [edx+ecx*8]      'C2R        C1I        C1R        SumR       SumI
              !fld st(0)                      'C2R        C2R        C1I        C1R        SumR       SumI
              !fmul st(0),st(3)               'C2R*C1R    C2R        C1I        C1R        SumR       SumI
              !faddp st(4),st(0)              'C2R        C1I        C1R        SumR       SumI
              !fmul st(0),st(1)               'C2R*C1I    C1I        C1R        SumR       SumI
              !faddp st(4),st(0)              'C1I        C1R        SumR       SumI
            
              !fld dword ptr [edx+ecx*8+4]    'C2I        C1I        C1R        SumR       SumI
              !fld st(0)                      'C2I        C2I        C1I        C1R        SumR       SumI
              !fmul st(0),st(2)               'C2I*C1I    C2I        C1I        C1R        SumR       SumI
              !fsubp st(4),st(0)              'C2I        C1I        C1R        SumR       SumI
              !fmul st(0),st(2)               'C2I*C1R    C1I        C1R        SumR       SumI
              !faddp st(4),st(0)              'C1I        C1R        SumR       SumI
              !fcompp                         'SumR       SumI
            
              !inc ecx
              !cmp ecx,limit
              !jle short lp1
            
              !fstp SumR                      'SumI
              !fstp SumI                      'empty
            
            
              CAvg.r = SumR / n
              CAvg.i = SumI / n
            
            END SUB
            
            
            
            
            
            SUB CMultiplySSE3(BYVAL n AS LONG, C1() AS Complex, C2() AS Complex, BYREF CAvg AS Complex)
            #REGISTER NONE
            LOCAL pC1, pC2, pSums, Limit AS LONG
            
            'local mxcsr as long
            
            LOCAL Sums() AS Complex
            DIM Sums(1)
            
              Limit = (n-1)\2
            
              pC1 = VARPTR(C1(0))
              pC2 = VARPTR(C2(0))
              pSums = VARPTR(Sums(0))
            
              !mov eax,pC1
              !mov edx,pC2
            
              !mov ecx,0
              !mov esi,0
            
              !pxor xmm3, xmm3              'clear xmm3 and use to accumulate the results
            
            #ALIGN 16
              lp1:
            
              !prefetchnta [eax+esi+1024]  'prefetxh the next block of memory so there's no delay when it's needed
              !movsldup xmm0, [eax+esi]   ; load real parts into the destination,   ; a1, a1, a0, a0
              !movaps xmm1, [edx+esi]     ; load the 2nd pair of complex values,    ; i.e. d1, c1, d0, c0
              !mulps xmm0, xmm1           ; temporary results, a1d1, a1c1, a0d0,    ; a0c0
              !shufps xmm1, xmm1, &hb1    ; reorder the real and imaginary          ; parts, c1, d1, c0, d0
              !movshdup xmm2, [eax+esi]   ; load the imaginary parts into the       ; destination, b1, b1, b0, b0
              !mulps xmm2, xmm1           ; temporary results, b1c1, b1d1, b0c0,    ; b0d0
              !addsubps xmm0, xmm2        ; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0,       ; a0c0-b0d0
            
              !addps xmm3,xmm0            'sum the results
            
            
              !add esi,16
            
              !inc ecx
              !cmp ecx,limit
              !jle short lp1
            
              !mov eax,pSums
              !movups [eax],xmm3
            
            
              CAvg.r = (Sums(0).r + Sums(1).r) / n
              CAvg.i = (Sums(0).i + Sums(1).i) / n
            
            
            END SUB

            Comment


            • #26
              Amazing! I didn't know that SSE could provide such a substantial boost in speed.

              "If you can manipulate your arrays to always be zero based, aligned on a 16 byte boundary and you can tolerate the rounding differences then it might be worth using."

              All of my arrays can be zero based and all of the computations provide SINGLE precision results. SSE is foreign to me but is the code after #ALIGN 16 intended to align the arrays on a 16-byte boundary? I have some other questions I'll send in a private message. Thanks.

              Comment


              • #27
                The #ALIGN 16 is there to align the code not the data.
                In short programming loops it's quicker to have branch targets on a 16 byte/cache line boundary.
                This stops the first instruction of the loop from straddling 2 cache lines which would slow execution.

                I thought the SSE code would be much faster but it's being limited by memory speed. It can't get the data from memory fast enough to keep the CPU busy.
                A computer with faster memory would see a bigger improvement in speed.

                In that code I align the arrays by trial and error!
                When I DIM the arrays to 10000000 in that position in the code they happen to be aligned correctly, but you shouldn't rely on this as it depends on where in the program they are dimmed and how big they are.

                One way to align arrays is to set aside a block of memory big enough to hold the array then, within that block, use DIM MyArray() AT SomeAlignedAddress.. .
                That way you can make sure the data is aligned at the chosen address.
                Or, do as I first suggested and calculate the unaligned values at either end of the data in BASIC and do the aligned bulk of the array with SSE.

                Comment


                • #28
                  Looks like it can be also used to speed up Neural Networks.
                  See Delta-Learning Rule:
                  https://www.youtube.com/watch?v=Ilg3gGewQ5U
                  --Theo Gottwald
                  ------------------------------------------------
                  76706 Dettenheim * Germany * [email protected]
                  ------------------------------------------------
                  Joses Forum * Theo's Link Site * IT-Berater.org

                  Comment

                  Working...
                  X