Announcement

Collapse
No announcement yet.

Dwords are bad m'kay?...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dwords are bad m'kay?...

    ... well dwords seem to be quite fast when used with SHIFT LEFT and SHIFT RIGHt. But otherwise, they seem to be very slow.

    I've been spending quite a few hours today and yesterday trying to optimize a piece of code which I had translated over from C source. The C source was about roughly 6x as fast as mine and I couldnt figure out why. I was using DWords in every case where they were using "unsigned longs" When compiled, my app was taking ~40 seconds to go from start to finish, the C app was taking ~7 seconds. After converting all my Dwords to Longs, my code is now ~11 seconds which is much better. Im going to make a few more changes and hopefully I can catch or even beat 7 seconds.


    Anyone know why the C code wasnt getting hammered when using the "unsigned long" variable (the PB equiv of dword right?) I compiled a version of the C code using "longs" instead of "unsigned longs" and it seemed to actually get slower by about 1.5 seconds... why could that be?

    Here are the results to a few tests I was running. If you can give any insight into why some of the numbers are the way they are (some of which seem counter intuitive) please do!

    Code:
    Test Machine = Pentium II Celeron 266 w/ 128megs PC100 RAM, running Windows98 original release.
    Data types tested = Integer, Long, Word, DWord, Single, Double
    Declarations Tested = LOCAL, REGISTER, EXT
    
    ---------------------------------------
    ---------------------------------------
    Test 1: XOR, AND, OR
    50 million itterations 
    --------------------------------------
    #COMPILE EXE
    #REGISTER NONE
    #DIM ALL
    
    FUNCTION PBMAIN()
        
        REGISTER TestVariable1 AS LONG
        REGISTER TestVariable2 AS LONG
        
        DIM Counter AS LONG
        DIM lngTime AS LONG
        DIM Itterations AS LONG
    
        Itterations = 50000000
        
        TestVariable1 = 1
        TestVariable1 = 2
    
        lngTime = TIMER
        FOR Counter = 1 TO Itterations
            TestVariable1 = TestVariable1 XOR TestVariable2
            TestVariable2 = TestVariable2 AND TestVariable1
            TestVariable1 = TestVariable2 OR TestVariable1
        NEXT Counter
    
        MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."
    
    END FUNCTION   
    --------
    Results |
    --------
    Registered Longs            1.37 seconds
    Registered Words            1.87 seconds 
    Local Words                 3.33 seconds
    Local Longs                 3.84 seconds
    Local Integers              4.12 seconds
    Registered Integers         4.81 seconds  (Note these were consistantly slower than LOCAL integers
    EXT Precision Floats       18.49 seconds
    Local Singles              19.98 seconds
    Local Doubles              25.14 seconds
    Registered Dwords          39.25 seconds
    Local Dwords               43.55 seconds
    
    
    ---------------------------------------
    ---------------------------------------
    Test 2: SHIFT LEFT , SHIFT RIGHT
    50 million itterations 
    --------------------------------------
    #COMPILE EXE
    #REGISTER NONE
    #DIM ALL
    
    FUNCTION PBMAIN()
        
        REGISTER TestVariable1 AS LONG
        REGISTER TestVariable2 AS LONG
        
        DIM Counter AS LONG
        DIM lngTime AS LONG
        DIM Itterations AS LONG
    
        Itterations = 50000000
        
        TestVariable1 = 16
        TestVariable1 = 32
    
        lngTime = TIMER
        FOR Counter = 1 TO Itterations
             SHIFT LEFT TestVariable1, 8&
            SHIFT LEFT TestVariable2, 8&
            SHIFT RIGHT TestVariable1, 8&
            SHIFT RIGHT TestVariable2, 8&        
        NEXT Counter
    
        MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."
    
    END FUNCTION   
    --------
    Results |
    --------
    Local Words                 6.89 seconds  (~2x faster than registered word)
    Local Integers              7.07 seconds  (locals seem to be faster than their registered counterparts overall)
    Local Dwords               7.89 seconds
    Local Longs                 7.90 seconds  
    Registered Dwords          8.20 seconds
    Registered Longs            8.47 seconds
    Registered Integers         12.20 seconds 
    Registered Words            12.23 seconds 
    EXT Precision Floats        N/A
    Local Singles               N/A
    Local Doubles               N/A
    
    
    ---------------------------------------------
    ---------------------------------------------
    TEST 3:  MULIPLY, DIVIDE
    5 million itterations
    ---------------------------------------------
    #COMPILE EXE
    #REGISTER NONE
    #DIM ALL
    
    FUNCTION PBMAIN()
        
        REGISTER TestVariable1 AS LONG
        REGISTER TestVariable2 AS LONG
        
        DIM Counter AS LONG
        DIM lngTime AS LONG
        DIM Itterations AS LONG
    
        Itterations = 5000000
        
        TestVariable1 = 100
        TestVariable2 = 200
    
        lngTime = TIMER
        FOR Counter = 1 TO Itterations
            TestVariable1 = TestVariable1 * TestVariable2
            TestVariable2 = TestVariable2 * TestVariable1
            TestVariable1 = TestVariable1 / TestVariable2
            TestVariable2 = TestVariable2 / TestVariable1
        NEXT Counter
    
        MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."
    
    END FUNCTION
    --------
    Results |
    --------
    Local Longs                 5.26 seconds
    Local Integers              5.29 seconds
    Local Singles               9.06 seconds
    Registered Longs            9.44 seconds  (why are these slower than Local declared longs?)
    EXT Precision Floats        9.66 seconds
    Registered Words            9.88 seconds
    Registered Integers         9.88 seconds  (same as registered Words) 
    Local Doubles              10.65 seconds
    Local Words                10.71 seconds
    Registered DWords          12.13 seconds
    Local DWords               13.22 seconds
    
    
    ---------------------------------------------
    ---------------------------------------------
    TEST 4:  ADDITIONS, SUBTRACTION, INCR
    50 million itterations
    ---------------------------------------------
    #COMPILE EXE
    #REGISTER NONE
    #DIM ALL
    
    FUNCTION PBMAIN()
        
        REGISTER TestVariable1 AS LONG
        REGISTER TestVariable2 AS LONG
        
        DIM Counter AS LONG
        DIM lngTime AS LONG
        DIM Itterations AS LONG
    
        Itterations = 50000000
        
        TestVariable1 = 999
        TestVariable2 = 111
    
        lngTime = TIMER
        FOR Counter = 1 TO Itterations
            TestVariable1 = TestVariable1 + TestVariable1
            TestVariable2 = TestVariable2 + TestVariable2
            TestVariable1 = TestVariable1 - TestVariable2
            TestVariable2 = TestVariable1 - TestVariable2
        NEXT Counter
    
        MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."
    
    END FUNCTION
    --------
    Results |
    --------
    Registered Longs            1.34 seconds 
    Registered Words            2.74 seconds
    Registered DWordS           2.96 seconds
    Local Integers              3.74 seconds
    Local Words                 3.92 seconds
    Local Longs                 3.96 seconds
    Local DWords                4.17 seconds
    Registered Integers         6.18 seconds  (slower than Local Integers, and significantly slower than registered Words) 
    EXT Precision Floats       91.99 seconds
    Local Singles              92.34 seconds
    Local Doubles             101.05 seconds
    
    ---------------------------------------------
    ---------------------------------------------
    TEST 5:  SQUARE ROOTS and EXPONENTS
    50 million itterations
    ---------------------------------------------
    #COMPILE EXE
    #REGISTER NONE
    #DIM ALL
    
    FUNCTION PBMAIN()
        
        REGISTER TestVariable1 AS LONG
        REGISTER TestVariable2 AS LONG
        
        DIM Counter AS LONG
        DIM lngTime AS LONG
        DIM Itterations AS LONG
    
        Itterations = 50000000
        
        TestVariable1 = 2
        TestVariable2 = 3
    
        lngTime = TIMER
        FOR Counter = 1 TO Itterations
            TestVariable1 = TestVariable1^1
            TestVariable2 = TestVariable2^1
            TestVariable1 = sqr(TestVariable2)
            TestVariable2 = sqr(TestVariable1) 
        NEXT Counter
    
        MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."
    
    END FUNCTION
    --------
    Results |
    --------
    EXT Precision Floats       38.94 seconds EXT registered precision floating points top in this scenario
    Local Singles              39.91 seconds
    Local Longs                43.85 seconds
    Local Integers             43.75 seconds
    Local Doubles             45.27 seconds
    Registered Words           46.20 seconds
    Registered Integers        46.40 seconds
    Registered Longs           46.41 seconds 
    Registered DWORDS          53.37 seconds
    Local Words                55.92 seconds
    Local DWORDS               60.02 seconds
      
    
    -----------------------------------------------------
    CONCLUSIONS
    What I learned (i think)
    - Avoid using Singles, Doubles and ESPECIALLY DWords and Words with bitwise operations like AND, OR, and XOR
    - Floating points are extremely slow when it comes to addition and substraction
    - When bitshifting, Local variables are faster than registered variables. (Anyone know why?)
    - EXT Floating points, and Doubles and Singles are slightly faster than other types when it comes to square roots and raising power.
    - During multiplication and divisions, register integers, longs and words are slower than local declared integers, longs and words. Anyone know why?

    Thanks!
    -Mike



    [This message has been edited by Mike Joseph (edited April 01, 2000).]

  • #2
    What I learned (i think)
    - Avoid using Singles, Doubles and ESPECIALLY DWords and Words with bitwise operations like AND, OR, and XOR
    - Floating points are extremely slow when it comes to addition and substraction
    - When bitshifting, Local variables are faster than registered variables. (Anyone know why?)
    - EXT Floating points, and Doubles and Singles are slightly faster than other types when it comes to square roots and raising power.
    - During multiplication and divisions, register integers, longs and words are slower than local declared integers, longs and words. Anyone know why?
    1. It's bad for performance of your app and other apps running.
    2. Yep
    3. To put it simply, 2 extra copy operations per SHIFT call. SHIFT is a function and wants a variable pointer in Ebx. So the compiler creates a hidden LOCAL LONG, copies ESI/EDI into the long, calls SHIFT, copies the long back into ESI/EDI.
    4. sqr is a floating. Power uses a function. (see #3)

    I'm not too good with ASM, but I think PB thinks of DWORDs as floating point numbers trapped in an integer body.
    Code:
    ' Integer divide with DWORD REGISTER
    mov         esi,20h                          ' var1 = 32
    mov         edi,10h                          ' var2 = 16
    mov         dword ptr [ebp-54h],edi          ' edi into stack var
    mov         dword ptr [ebp-50h],0            ' 0 into stack var
    fild        qword ptr [ebp-54h]              ' convert stack var to 80-bit float on float stack
    mov         dword ptr [ebp-54h],esi 
    mov         dword ptr [ebp-50h],0
    fild        qword ptr [ebp-54h]              ' convert stack var to 80-bit float on float stack
    call        00401827                         ' integer floating point divide function
    00401827   frndint                           ' round float stack var to integer
    00401829   fxch        st(1)                 ' swap float stack
    0040182B   frndint                           ' round float stack var to integer
    0040182D   fdivp       st(1),st              ' floating divide
    0040182F   fldcw       word ptr ds:[401516h] ' load float control register with constant?
    00401835   frndint                           ' round float stack var to integer
    00401837   fldcw       word ptr ds:[401510h] ' load float control register with constant?
    0040183D   ret                               ' return
    fistp       qword ptr [ebp-54h]              ' convert float stack var to 64-bit integer
    mov         esi,dword ptr [ebp-54h]          ' copy stack var back to esi
    ' Integer divide with LONG REGISTER
    mov         esi,20h
    mov         edi,10h
    mov         eax,esi
    cdq                                   ' convert eax to 64-bit int
    idiv        eax,edi 
    mov         esi,eax
    ' Multiply with DWORD REGISTER
    mov         esi,20h                    
    mov         edi,10h
    mov         dword ptr [ebp-54h],edi   ' edi into stack var
    mov         dword ptr [ebp-50h],0     ' 0 into stack var  
    fild        qword ptr [ebp-54h]       ' convert stack var to 80-bit float on float stack
    mov         dword ptr [ebp-54h],esi   
    mov         dword ptr [ebp-50h],0
    fild        qword ptr [ebp-54h]       ' convert stack var to 80-bit float on float stack
    fmulp       st(1),st                  ' floating point multiply
    fistp       qword ptr [ebp-54h]       ' convert float stack var to 64-bit integer
    mov         esi,dword ptr [ebp-54h]   ' copy stack var back to esi
    ' Multiply with LONG REGISTER
    mov         esi,20h
    mov         edi,10h
    mov         eax,edi
    imul        esi
    mov         esi,eax
    SUMMARY: DWORDs are slow for your app and other apps on the system! After thinking it over more, I would hope this is a compiler bug and not something the compiler does on purpose.


    [This message has been edited by Enoch S Ceshkovsky (edited April 02, 2000).]

    Comment


    • #3
      Thanks for the feeback. I think you are right about DWord performance being a bug in the compiler. I ran the same 5 tests below against QUAD integers and even QUAD integers are significantly faster than DWORDS in 3 of the 5 tests.

      Test1: XOR, AND, OR
      Quad Integers = 29.41 seconds (faster than DWORDS by about ~10 secs)

      Test2: SHIFT LEFT, SHIFT RIGHT
      Quad Integers = 70.32 (quads are the slowest integer type for SHIFTing bits)

      Test3: MULTIPLY, DIVIDE
      Quad Integers = 5.77 seconds ( over twice as fast as DWORDS)

      TEST4: ADDITIONS, SUBTRACTION
      Quad Integers: 55.20 seconds (slowest of all integer types, but not as slow as floats)

      TEST5: SQUARE ROOTS AND EXPONENTS
      Quad Integers: 50.39 seconds (faster than DWORDS by ~9 seconds)

      So if you need integers larger than LONG type and you dont plan on using SHIFT and are mostly doing multiplies/divides, bitwise operations with very few additions/subtractions, then you are better off going with QUAD integers over DWORDs.

      -Mike

      ------------------


      [This message has been edited by Mike Joseph (edited April 03, 2000).]

      Comment


      • #4
        It appears PTR's are affected by this also, but worse they have to call to PB runtime functions that do the dword->quad->float, float->quad->dword. Quad should be faster because it doesnt have the extra step of dword->quad.
        Regardless, PB shouldn't be using floating math with a quad, dword, or a pointer.


        [This message has been edited by Enoch S Ceshkovsky (edited April 03, 2000).]

        Comment


        • #5
          Mike, try using inline asm to shift bits. I found a while ago that there was
          something about PowerBASIC's Shift staement that needed optimizing. Hopefully
          they'll change the Shift statement to a function which would permit more efficient
          coding.

          'SHIFT LEFT TestVariable1, 8&
          'SHIFT LEFT TestVariable2, 8&
          !shl TestVariable1, 8&
          !shl TestVariable2, 8&
          'SHIFT RIGHT TestVariable1, 8&
          'SHIFT RIGHT TestVariable2, 8&
          !shr TestVariable1, 8&
          !shr TestVariable2, 8&


          Ron

          Comment


          • #6
            Awesome, thank you! You made my day. The code ive been working on i was able improve the performance of processing a simulated 4 meg file to ~6.3 seconds. Switching to the inline assembly for shifting just dropped it down to ~4.8 seconds! I guess you really can squeeze blood out of a turnip

            The VC6 compiled code I'm using for comparison is churning out ~2.9seconds (originally i thought it was ~7seconds, but i was seriously disappointed to find out i was compiling a debug version.. when i switched it to Release configuration, it improved from 7 to 2.9)

            The LCC-Win32 compiler produces an exe that can process the file in ~4.9 seconds which makes the PB version faster by about .1 seconds.

            Whats interesting is, i used the _asm shr x,8; in place of the x >>= 8; in both the VC compiled version and the LCC-Win32 versions and the performance stayed pretty much the same.

            Guess now i'm going to have to start learnging asm in my spare time
            Thanks again.
            -Mike


            ------------------

            Comment


            • #7
              Mike, declare the same register variables in C and see if the results are different.
              Were you able to do inline asm in LCC?

              Ron

              Comment


              • #8
                Another comment to make here.
                Code:
                MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds
                Should be:
                Results = Timer-lngTime
                MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(Results,"####.##") & " seconds
                The first isnt very fair to PB if you want to compare it against VC6.


                ------------------

                Comment


                • #9
                  Enoch, i did update the position of the timer as you suggested. Results seem more or less the same. Unless the change is at least .1 second worth its hard to notice since even during seperate time tests on the same exact .exe results in variations of +/- 0.2 seconds (seems OS timing related)

                  Ron, in my PB code im already using register variables but I wasnt in the C version. I did a couple of tests on the C version compiled with VC and i was able to improve it from 2.9 to just about 2.09.

                  Using register variables with the LCC-win32 compiled version didnt seem to help at all. Regarding the use of _asm in the LCC version, i was in error. I in fact i had to comment those out to get it to compile.

                  I've added !xor assembly operators in place of some of the basic XOR operator and now my PB version is ~4.3 seconds which is better than the LCC version by .7 seconds. I have a few other XOR location's id like to do this to, but Im not sure how to get it to work with a Pointer. Any ideas? Here is the line I'd like to convert to asm.

                  X = X XOR @data.P(i)

                  Is there a faster way to do this then the following?

                  temp = @data.P(i)
                  !xor x,temp

                  It also seems that !xor in PB only works if either the source or the destination is a register variable... the AND operator seems to work with both. Also, I have used asm on the AND operator and have replaced the following line:

                  d = x AND 255 REM this is the original line

                  with an asm version which looks as follows

                  !and x,255 REM this version requires 3 lines of code
                  d = x
                  x= xOrig

                  This gives me a speed boost, but because the results are stored in the x variable, it needs to be restored since i must retain the value of x. Is there a way to avoid this?

                  -Mike



                  ------------------

                  Comment


                  • #10
                    Valid for XOR,OR,AND:
                    Code:
                    DIM A AS LONG, B AS LONG, pB as LONG PTR
                    REGISTER R as LONG
                    pB = VARPTR(B)
                    !XOR R,A   ' R has result
                    !XOR R,255
                    !XOR A,R
                    !XOR A,255
                    !XOR R, OFFSET pB  ' R = R XOR @pB
                    !XOR OFFSET pB, R  ' @pB = @pB XOR R
                    ' or this
                    !MOV R, B
                    !XOR A, R  ' A has result
                    ' or this
                    !MOV R, A
                    !XOR B, R  ' B has result
                    Here's a website you might find useful:
                    http://www.ece.uiuc.edu/ece291/books/artofasm.html
                    There are also ASM tutorials for PB programmers, if you search the Programming Forum.


                    [This message has been edited by Enoch S Ceshkovsky (edited April 03, 2000).]

                    Comment


                    • #11
                      Cool, thanks Enoch.

                      -Mike

                      ------------------

                      Comment


                      • #12
                        I have got much the same type of results as Ron Pierce with the difference
                        between using inline asm instructions and the intrinsic PB functions. The
                        times below are specific to my own computer which is a 600 PIII so if you
                        have either a faster or slower box, just alter the loop count so that the
                        minimum duration is over a half a second and you will get results within a
                        percent or so.

                        I have constructed the loop in asm to remove any interaction with
                        different loop types and variables and what I have found is that there is
                        no difference between DWORD and LONG when using a variable of that size.

                        In the 3 comparisons, the shl/shr instructions using a register are a lot
                        faster that using a memory variable and the intrinsic functions are slower
                        again.

                        The analysis by Enoch makes this slower operation clear as it is doing
                        more than the inline functions per loop.

                        The speed difference between a global variable and a local variable has
                        been documented by Intel with stack variables being generally faster. I
                        think it has some to do with the relocations in the PE header but even if
                        you need to use a global variable, copy it to a stack variable if it is
                        possible to try and get additional speed out of the loop.

                        Another factor from the Intel documentation is the order and size of stack
                        variables. It is generally preferred to put the DWORD/LONG size variables
                        first as they align on a 4 byte boundary and are accessed in one read.
                        put WORD, BYTE and others after it.

                        LOCAL var1 as LONG
                        LOCAL var2 as DWORD
                        LOCAL var3 as WORD
                        LOCAL var4 as BYTE

                        If you need to control the alignment, there is a trick in PowerBASIC where
                        you can use dummy structure (UDT) which is DWORD aligned so if you have
                        a long mess of different size data types on the stack, you can use the
                        structure as a method of aligning the following piece of data.

                        LOCAL dummy as My_DWORD_Aligned_Dummy_UDT

                        I tested the speed diference with Do loops and For loops and it seems that
                        the exit condition in a Do loop is what slows it down if it is done using
                        If <condition> Then Exit Do. A Do loop is faster if the exit is done with
                        a ! cmp REG, NUMBER.

                        If the loop is truly speed critical, construct the loop using a label and
                        a conditional jump at the end of the loop and use a register as a counter
                        if it is possible.

                        Regards,

                        [email protected]

                        Code:
                          
                          SUB LoopTest
                          
                              #REGISTER NONE
                          
                              LOCAL lvar1 as LONG     'DWORD
                          
                              lvar1 = 10000
                          
                              tc& = GetTickCount()
                          
                              ! xor edx, edx      ; zero counter
                              ! mov eax, 10000
                          
                            ltSt:
                              ! shl eax, 8        ; 500 ms
                              ! shr eax, 8
                           '     ! shl lvar1, 8      ; 2005 ms
                           '     ! shr lvar1, 8
                           '     SHIFT LEFT  lvar1, 8    ' 3860 ms
                           '     SHIFT RIGHT lvar1, 8
                              ! inc edx
                              ! cmp edx, 100000000
                              ! jne ltSt
                          
                              MessageBox hWnd,ByCopy str$(GetTickCount() - tc&), _
                                         "clock",%MB_OK or %MB_ICONINFORMATION
                          
                          END SUB
                        ------------------
                        hutch at movsd dot com
                        The MASM Forum

                        www.masm32.com

                        Comment


                        • #13
                          I wonder if any of the PB staff have a comment about why DWORDs & QUADs are using floating math...


                          ------------------

                          Comment


                          • #14
                            I don't know why DWORD's are using floating point, but the reason that QUADs (64-bit) using floating point is because the last time I looked, Intel wasn't shipping a 64-bit CPU yet.

                            --Dave


                            ------------------
                            PowerBASIC Support
                            mailto:[email protected][email protected]</A>
                            Home of the BASIC Gurus
                            www.basicguru.com

                            Comment


                            • #15
                              the reason that QUADs (64-bit) using floating point is because the last time I looked, Intel wasn't shipping a 64-bit CPU yet.
                              Cute
                              Quads can be manipulated faster with integer math, but it's easier for the compiler to use floats. Same idea as multiplying 32-bit variables in 16-bit. Something to add to the bottom of the list of "PB Optimizations".


                              ------------------

                              Comment

                              Working...
                              X