Announcement

Collapse
No announcement yet.

How fast can we compute the dot product?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How fast can we compute the dot product?

    Let's assume that we've two arrays (vectors) of single numbers X and Y
    of the dimension n. The dot product or scalar product is defined as:
    <X,Y> = X(1)*Y(1) + X(2)*Y(2) + ... + X(n)*Y(n)

    Let's further assume that both vectors are stored in consecutive
    memory locations. A typical BASIC solution would be:
    Code:
    FUNCTION DotProduct(X() AS SINGLE, Y() AS SINGLE,n AS INTEGER) AS SINGLE
      DIM sum AS SINGLE                
      DIM i AS INTEGER                   
      sum = 0
      FOR i = 1 TO n
        sum = sum + X(i)*Y(i)
      NEXT i
      DotProduct = sum
    END FUNCTION
    This function should work with every PB compiler, including PB/DOS.
    Is that the fastest inner loop pattern?

    Philipp

  • #2
    Code:
    '*************************************************************************
    ' Purpose:
    '
    ' Function to compute the DOT (inner) product of two 2D vector
    ' components.
    '*************************************************************************
    FUNCTION vec2d_Dot ALIAS "vec2d_Dot" (BYVAL vI AS DOUBLE, _
                                          BYVAL vJ AS DOUBLE, _
                                          BYVAL uI AS DOUBLE, _
                                          BYVAL uJ AS DOUBLE) EXPORT AS DOUBLE
    
      #DEBUG PRINT "==========[" & FUNCNAME$ & "]=========="
    
      FUNCTION = (vI * uI) + (vJ * uJ)
    
    END FUNCTION
    This is what I use to calculate the dot product between two 2D vectors. Could be expanded to include arrays.
    Later...

    JR

    "When governments fear the people there is liberty. When people fear the government there is tyranny." - Thomas Jefferson

    Comment


    • #3
      I don't know what is your target platform, but I suspect using SSE instructions, which PBWIN supports since 9.0, could provide extreme boost as it performs 4 operations at once.
      Of course all of this at cost of requirement of modern CPU.

      The solution below should run on any CPU.

      What is different:
      - using REGISTER variables
      - LONG instead of INTEGER
      - arrays indexed from 0 are said to be faster, value of this element is discared to maintain compatibility with code assuming 1..n
      - array pointers

      It provided 30% more speed on single core 64bit Sempron running on 32 bit OS

      Code:
      #COMPILE EXE
      #DIM ALL
      
      DECLARE FUNCTION GetTickCount LIB "KERNEL32.DLL" ALIAS "GetTickCount" () AS DWORD
      
      FUNCTION DotProduct_Original(X() AS SINGLE, Y() AS SINGLE,n AS LONG) AS SINGLE
        DIM sum AS SINGLE
        DIM i AS LONG
        sum = 0
        FOR i = 1 TO n
          sum = sum + X(i)*Y(i)
        NEXT i
        DotProduct_Original = sum
      END FUNCTION
      
      FUNCTION DotProduct_Tuned(X() AS SINGLE, Y() AS SINGLE,n AS LONG) AS SINGLE
        REGISTER i AS LONG
        DIM sum AS SINGLE
      
        DIM px AS SINGLE PTR
        DIM py AS SINGLE PTR
        
        px = VARPTR(x(0))
        py = VARPTR(y(0))
        
        FOR i = 1 TO n
          sum += @px[i]*@py[i]
        NEXT i
        DotProduct_Tuned = sum
      END FUNCTION
      
      %MAXDIM = 40000000
      FUNCTION PBMAIN () AS LONG
        REGISTER i AS LONG
        DIM X(%MAXDIM) AS SINGLE, Y(%MAXDIM) AS SINGLE
        DIM t1 AS LONG, t2 AS LONG
        DIM result AS SINGLE
        
        DIM TextResults AS STRING
      
      
        RANDOMIZE 1
        FOR i = 1 TO %MAXDIM
          X(i) = RND(1, 10)
          Y(i) = RND(1, 10)
        NEXT
      
        MSGBOX "Data generated"
      
      
        t1 = GetTickCount
          result = DotProduct_Original( X(), Y(), %MAXDIM)
        t2 = GetTickCount
        TextResults = "Original: Result="+FORMAT$(result, "0")+" Time="+STR$(t2-t1)+"ms"
        
        t1 = GetTickCount
          result = DotProduct_Tuned( X(), Y(), %MAXDIM)
        t2 = GetTickCount
        TextResults += $CRLF +"Tuned: Result="+FORMAT$(result, "0")+" Time="+STR$(t2-t1)+"ms"
        
        MSGBOX TextResults
      
      END FUNCTION
      Last edited by Petr Schreiber jr; 5 Jun 2009, 04:19 AM.
      [email protected]

      Comment


      • #4
        Try changing:
        Code:
        DIM sum AS SINGLE
        to
        Code:
        REGISTER sum AS EXT
        Paul.

        Comment


        • #5
          Philipp,

          no, it's not the fastest pattern. When it comes to performance, there's always room for improvement. Your procedure is a special case of the so called SAXPY function, a common operation in the BLAS (Basic Linear Algebra Subprograms).

          As Peter Schreiber jr. wrote, your target platform is an important question. With PB/Win or PB/CC you should use register variables. But not only the loop counter i, also the single value for sum should be a register value. Holding that critical variable inside the floating point unit will save a lot of stack operations and lead to better performance. Paul Dixon made that point. That's not possible with PB/DOS; that compiler don't "know" register values at the BASIC level. You could realize register variables with PB/DOS only in assembly language procedures.

          The code you've presented, will run with 3 cycles per vector component. You can improve the calculation time by stretching and partial unrolling the loop, and seperating the even and odd components in the sum. It breaks the dependency chain and the CPU has a better chance to schedule the instructions for pipelining and out of order execution. In a code snippet, that looks like:

          Code:
          s0, s1 (both single, may be register variables)
          FOR i = 1 TO n STEP 2
            s0 = s0 + X(i)*Y(i)
            s1 = s1 + X(i+1)*Y(i+1)
          NEXT i
          FUNCTION = s0 + s1
          This code should run with 2 cycles per vector component. Further unrolling the loop to 4 components doesn't change the run time on my Athlon X2. It seems that you can't do more in BASIC or any other high level language.

          In nearly every instance, the reason for resorting to assembly programming is to increase processing performance. But before you even open your processor's opcode manual, the first question to ask is simple: How fast does the code need to execute? Again, it is worth noting that assembly-level programming should be the last resort for increasing performance. Although the optimization benefits can be quite large, the overall flexibility of the source will be compromised because architecture-specific elements tend to decrease the future flexibility of the code. From this point forward, I assume that you are aware of this and mostly interested in as fast as possible.

          If so, you could reach 1 cycle per vector component with the simplest thinkable SSE2 pattern. If both arrays are not to long and fit into the L1 cache you could reach up to 0.5 cycle per component.

          Gunther
          Encoding Team

          Comment


          • #6
            Thank you for your replies.

            Petr:
            but I suspect using SSE instructions, which PBWIN supports since 9.0, could provide extreme boost as it performs 4 operations at once.
            Yes, I know that; do you have some experience with SSE?

            Gunther:
            From this point forward, I assume that you are aware of this and mostly interested in as fast as possible.

            If so, you could reach 1 cycle per vector component with the simplest thinkable SSE2 pattern. If both arrays are not to long and fit into the L1 cache you could reach up to 0.5 cycle per component.
            Yes, of course. But 0.5 cycle per component? Could you explain that or give an example?

            Regards
            Philipp

            Comment


            • #7
              Philipp,
              how big are the vectors that you wish to compute the dot product of and how many dot products do you need to compute?
              The size makes a difference to the best way to calculate them.
              This function should work with every PB compiler, including PB/DOS.
              If you use SSE instruction then it won't work with PB/DOS.

              Paul.

              Comment


              • #8
                Paul,

                If you use SSE instruction then it won't work with PB/DOS.
                It seems to me that's not right. Please check http://www.powerbasic.com/support/do.../assembler.htm or http://www.powerbasic.com/support/downloads/dos.htm. You'll find there SSE code for PB/DOS.

                Anke

                Comment


                • #9
                  Hi Philipp,

                  all I know about SSE I know from Charles Pegge, he is real guru on this topic.

                  Here is little quick example of multiplying 2 sets of 4 numbers.
                  Code:
                    TYPE t4Values
                       a AS SINGLE
                       b AS SINGLE
                       c AS SINGLE
                       d AS SINGLE
                    END TYPE
                  
                    SUB Multiply2Vectors(a AS t4Values, b AS t4Values, c AS t4Values)
                      LOCAL pa,pb,pc AS LONG
                      pa=VARPTR(a)
                      pb=VARPTR(b)
                      pc=VARPTR(c)
                      ! mov eax, pa
                      ! MOVUPS XMM0, [eax]        ' -- Load first set
                      ! mov eax, pb
                      ! MOVUPS XMM7, [eax]        ' -- Load second set
                      ! MULPS  XMM0, XMM7         ' -- Multiply
                      ! mov eax, pc
                      ! MOVUPS [eax], XMM0        ' -- Return to c
                    END SUB
                    
                    FUNCTION PBMAIN
                      LOCAL v1, v2, v3 AS t4Values
                      v1.a = 80
                      v1.b = 70
                      v1.c = 60
                      v1.d = 50
                  
                      v2.a = 10
                      v2.b = 20
                      v2.c = 30
                      v2.d = 40
                  
                      Multiply2Vectors(v1,v2,v3)
                  
                      MSGBOX USING$("# # # #", v3.a, v3.b, v3.c, v3.d )
                    END FUNCTION
                  Charles provides some consulting on famous Josés forum


                  Petr

                  Originally posted by Philipp Kaempfner View Post
                  Thank you for your replies.

                  Petr:
                  Yes, I know that; do you have some experience with SSE?

                  Gunther:
                  Yes, of course. But 0.5 cycle per component? Could you explain that or give an example?

                  Regards
                  Philipp
                  [email protected]

                  Comment


                  • #10
                    Anke,
                    PBDOS does not recognise opcodes beyond the 80386. It cannot assemble SSE opcodes.
                    PBDOS runs in a 16 bit environment which does not have the opcodes available so even if you were to hand assemble the SSE opcodes (which would be a complicated job!) they would not run in a DOS program as the CPU is in the wrong mode and will not recognise the opcodes.

                    There are dozens of programs on those links you provided. Can you provide a link to a single program which claims to use SSE instruction in PB/DOS and I'll have a look a it.

                    Paul.

                    Comment


                    • #11
                      Here's a program calculating the dot product of 2 vectors each of 4 elements. Make sure you CPU has SSE3 or it'll not work.

                      Code:
                      'PBCC5.01 program
                      #REGISTER NONE
                      #DIM ALL
                      
                      FUNCTION PBMAIN () AS LONG
                      
                      LOCAL a(),b(),c(),suma, sumb AS SINGLE
                      LOCAL aptr,bptr,cptr AS SINGLE PTR
                      LOCAL r AS LONG
                      
                      'DIM the vectors and get pointers to the start of each one
                      DIM a(3),b(3)
                      aptr=VARPTR(a(0))
                      bptr=VARPTR(b(0))
                      
                      a(0)=21: a(1)= 5: a(2)=-8: a(3)=4
                      b(0)=12: b(1)=-3: b(2)=-2: b(3)=6
                      
                      'do the dot product in BASIC
                      FOR r = 0 TO 3
                          suma=suma+a(r&)*b(r)
                      NEXT
                      PRINT "BASIC result";suma
                      
                      
                      'do the dot product in ASM using SSE3
                      !mov eax,aptr       'pointer to first vector
                      !mov ecx,bptr       'pointer to second vector
                      !movdqu xmm0,[eax]  'get first vector
                      !movdqu xmm1, [ecx] 'get second vector
                      !mulps xmm0,xmm1    'do all 4 SINGLE multiplies
                      !haddps xmm0,xmm0   'sum pairs of results
                      !haddps xmm0,xmm0   'sum the sum of pairs of results
                      !movd sumb,xmm0     'store the dot product
                      !emms               'empty MMX state
                      
                      PRINT "ASM result=";sumb
                      
                      WAITKEY$
                      
                      END FUNCTION

                      Comment


                      • #12
                        PBDOS does not recognise opcodes beyond the 80386. It cannot assemble SSE opcodes
                        So create an OBJ module with your assembler (or with DEBUG) and $LINK it ; or, load the opcodes into memory and CALL ABSOLUTE (which is CALL DWORD in PB/DOS).
                        Michael Mattias
                        Tal Systems Inc. (retired)
                        Racine WI USA
                        [email protected]
                        http://www.talsystems.com

                        Comment


                        • #13
                          So create an OBJ module with your assembler (or with DEBUG) and $LINK it ; or, load the opcodes into memory and CALL ABSOLUTE (which is CALL DWORD in PB/DOS).
                          Yes, Michael is right and Anke is right, too.

                          PBDOS runs in a 16 bit environment which does not have the opcodes available so even if you were to hand assemble the SSE opcodes (which would be a complicated job!) they would not run in a DOS program as the CPU is in the wrong mode and will not recognise the opcodes.
                          Paul, please check Vectorization with SIMD instructions under the links, which Anke provided. You'll find there SSE example code for PB/DOS.

                          SSE opcodes have nothing to do with the CPU mode. We've to seperate a few things:
                          • The first SIMD extension was MMX. It uses the same register space as the FPU, so one cannot use MMX and floating point operations at the same time.
                          • The next was SSE. It adds a separate register space to the microprocessor.

                          That's also true for SSE2, SSE3 etc. These new registers XMM0 to XMM7 in a 32 bit environment are using the floating point units (loader, adder and multiplier) and not the CPU units. It's possible to use those units in the Protected Mode and in the Real Mode. There are no restrictions.

                          But you made a good point:
                          how big are the vectors that you wish to compute the dot product of and how many dot products do you need to compute?
                          The size makes a difference to the best way to calculate them.
                          At the moment I'm writing some example code for Philipp. I had to make some assumptions for that: Let's say we have 32 KB L1 cache. So we can have a size of n = 2048 elements for each single array, using 16 KB for both vectors (leaving some cache space free). I'll upload the code in the next days and hope the arrays will be large enough for Philipp.

                          Gunther
                          Encoding Team

                          Comment


                          • #14
                            Gunther,
                            SSE opcodes have nothing to do with the CPU mode
                            Yes, I got that bit wrong, they can run in 16bit mode with the usual 16bit restictions.

                            Paul.

                            Comment


                            • #15
                              Gunther:

                              I had to make some assumptions for that: Let's say we have 32 KB L1 cache. So we can have a size of n = 2048 elements for each single array, using 16 KB for both vectors (leaving some cache space free). I'll upload the code in the next days and hope the arrays will be large enough for Philipp.
                              There are 2 arrays of 256 elements each in my application. 2048 elements seems to be large enough. Thank you for your effort.

                              Philipp

                              Comment


                              • #16
                                Here's a first go at the problem.
                                Performance differs wildly on different CPUs.
                                On a modern AMD the SSE version can complete in under 230 clks, that's less than 1 CPU clk per element.
                                Code:
                                'PBCC5.01 program
                                '%sse3=1
                                TYPE dquad
                                    a AS SINGLE
                                    b AS SINGLE
                                    c AS SINGLE
                                    d AS SINGLE
                                END TYPE
                                
                                FUNCTION PBMAIN () AS LONG
                                
                                LOCAL a(),b() AS SINGLE
                                LOCAL aptr,bptr AS SINGLE PTR
                                LOCAL r,l AS LONG
                                LOCAL z AS dquad
                                LOCAL res() AS LONG
                                LOCAL t AS QUAD
                                LOCAL mydata AS STRING*100000    'a block of memory within which I can store the data and ensure it's aligned
                                
                                DIM a(255) AT VARPTR(mydata)+2048 AND &hffffff00   'allign a() on a 256 byte boundary
                                DIM b(255) AT VARPTR(mydata)+4096 AND &hffffff00   'allign b() on a 256 byte boundary
                                
                                
                                'z.a=133:z.b=2:z.c=3:z.d=4
                                
                                aptr=VARPTR(a(0))
                                bptr=VARPTR(b(0))
                                
                                DIM res(20)
                                
                                'fill arrays with dummy data
                                FOR r = 0 TO 255
                                    a(r)=r
                                    b(r)=3*r
                                NEXT
                                
                                FOR l = 1 TO 10
                                TIX t
                                sum!=0
                                
                                'do it in BASIC
                                FOR r& = 0 TO 255
                                    sum!=sum!+a(r&)*b(r&)
                                NEXT
                                
                                
                                TIX END t
                                res(l)=t
                                NEXT
                                
                                FOR r = 1 TO 10
                                    PRINT r,res(r)
                                NEXT
                                
                                PRINT "BASIC result";sum!
                                
                                sum!=0
                                FOR l = 1 TO 10
                                TIX t
                                
                                'do it using the FPU in ASM
                                !mov eax,aptr                 'get pointer to start of first array
                                !mov edx,bptr                 'get pointer to start of second array
                                
                                !mov ecx,255
                                !fldz                         'set sum=0
                                #ALIGN 16
                                lp2:
                                
                                !fld dword ptr [eax+ecx*4]    'get a value from first vectopr
                                !fmul  dword ptr [edx+ecx*4]  'multiply by coresponding value in second vector
                                !faddp st(1),st(0)            'add result to sum
                                
                                !dec ecx                      'next
                                !jns lp2
                                !fstp sum!
                                
                                TIX END t
                                res(l)=t
                                NEXT
                                
                                FOR r = 1 TO 10
                                    PRINT r,res(r)
                                NEXT
                                PRINT "  FPU result";sum!
                                
                                sum!=0
                                
                                'get data into cache
                                'for r&=0 to 255 step 16
                                '    a(r)=a(r)
                                '    b(r)=b(r)
                                'next
                                
                                FOR l = 1 TO 10
                                
                                '!cpuid
                                TIX t
                                #IF 1
                                'do it in SSE ASM
                                !mov ecx,0                    'loop counter for elements to get
                                
                                !movdqu xmm4,z                'zero the sum registers
                                !movdqa xmm5,xmm4
                                !movdqa xmm6,xmm4
                                !movdqa xmm7,xmm4
                                
                                !mov eax,aptr                 'get pointer to start of first array
                                !mov edx,bptr                 'get pointer to start of second array
                                
                                #ALIGN 16
                                lp:
                                
                                !movdqa xmm0,[eax+ecx*4]      'get 16 elements of first vector
                                !movdqa xmm1,[eax+ecx*4+16]
                                !movdqa xmm2,[eax+ecx*4+32]
                                !movdqa xmm3,[eax+ecx*4+48]
                                
                                
                                !mulps xmm0,[edx+ecx*4]        'multiply by 16 elements of second vector
                                !mulps xmm1,[edx+ecx*4+16]
                                !mulps xmm2,[edx+ecx*4+32]
                                !mulps xmm3,[edx+ecx*4+48]
                                
                                
                                !addps xmm4,xmm0               'add the products
                                !addps xmm5,xmm1
                                !addps xmm6,xmm2
                                !addps xmm7,xmm3
                                
                                
                                !add ecx,16                     'next block
                                !cmp ecx,256
                                !jne lp
                                
                                !addps xmm4,xmm5                'add the partial sums
                                !addps xmm6,xmm7
                                !addps xmm4,xmm6
                                
                                #IF %DEF (%sse3)
                                !haddps xmm4,xmm4   'sum pairs of results
                                !haddps xmm4,xmm4   'sum the sum of pairs of results
                                #ELSE
                                !movaps xmm1, xmm4
                                !shufps xmm4, xmm1, &hb1
                                !addps xmm4, xmm1
                                !movaps xmm1, xmm4
                                !shufps xmm4, xmm4, &h0a
                                !addps xmm4, xmm1
                                #ENDIF
                                
                                
                                !movd sum!,xmm4     'store the dot product
                                !emms
                                #ENDIF
                                
                                TIX END t
                                res(l)=t
                                NEXT
                                
                                FOR r = 1 TO 10
                                    PRINT r,res(r)
                                NEXT
                                
                                PRINT "  SSE result";sum!
                                
                                WAITKEY$
                                
                                END FUNCTION

                                Comment


                                • #17
                                  Gunther,
                                  you sent me a PM but I can't reply. I get this error:
                                  Gunther Ilzig has chosen not to receive private messages or may not be allowed to receive private messages. Therefore you may not send your message to him/her.
                                  Paul.

                                  Comment


                                  • #18
                                    Gunther Ilzig has chosen not to receive private messages or may not be allowed to receive private messages. Therefore you may not send your message to him/her.
                                    Gunther updated his email address in his forum account settings and needs to validate this new email address by clicking the link in the email sent to his new email address. Until this is done, he cannot PM, post, etc.

                                    Gunther if you are having problems with this, email me at [email protected] and I will do what I can to help.
                                    Sincerely,

                                    Steve Rossell
                                    PowerBASIC Staff

                                    Comment


                                    • #19
                                      Steve,

                                      I've validated my new mail address and it worked well. Thank you.

                                      The title of this thread is How fast can we compute the dot product? Here is my answer: So fast can we compute the dot product!

                                      I've uploaded a test program in the PB files area under: http://www.powerbasic.com/support/downloads/dos.htm. The title is Computing the Dot Product with PB/DOS. It can be downloaded directly at: http://www.powerbasic.com/support/do...DotProduct.zip.

                                      The package shows the efficient calculating with SSE instructions. It's easy to rewrite it for PB/CC or PB/Win. All source code and documentation is included.

                                      Gunther
                                      Encoding Team

                                      Comment


                                      • #20
                                        Gunther,
                                        The title of this thread is How fast can we compute the dot product? Here is my answer: So fast can we compute the dot product!
                                        My answer, on my ageing Atlhon XP 2600+ CPU (1.9GHz), if I repeat the calculation 1,000,000 times, as in your tests, then I avearage under 200ns for a dot product on two 256 element vectors.

                                        That's a total run time of 0.2s for the 1 million dot product calculations, 4 times quicker than the 0.88s reported by you.
                                        Some of the difference might be because you runs in DOS. I ran mine in Windows.

                                        Paul.

                                        Comment

                                        Working...
                                        X