Announcement

Collapse
No announcement yet.

FPU precision speed

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    FPU precision speed

    I just found out that a piece of petroleum industry standard economic software (written in windows) that we use at work is programmed in SINGLE precision. Bigger numbers were changing slightly on us and I was trying to figure out why. When I talked to a person on the help desk and found this out I went from disbelief to anger. He defended it by saying that single precision was so much faster than double precision. I can't imagine that there is much difference in this going thru the FPU. Has anyone have experience with this issue. I will test it myself when I have the time - I know graphics api's use singles over doubles for speed reasons.
    Last edited by james klutho; 8 Apr 2008, 03:51 PM.

    #2
    Here is some test code, I didn't see much difference. You may want to try some other operations than addition too.

    Note: added later--code to set FPU precision

    Code:
    #COMPILE EXE
    #DIM ALL
    #REGISTER NONE
    DECLARE FUNCTION QueryPerformanceCounter LIB "KERNEL32.DLL" ALIAS "QueryPerformanceCounter" (lpPerformanceCount AS QUAD) AS LONG
    DECLARE FUNCTION QueryPerformanceFrequency LIB "KERNEL32.DLL" ALIAS "QueryPerformanceFrequency" (lpFrequency AS QUAD) AS LONG
    
    '~~~~~~~~~~~A Variation of Dave Roberts' MACRO Timer~~~~~~~~~~~~~~~~~~~~~~~
    MACRO onTimer
      LOCAL qFreq, qOverhead, qStart, qStop AS QUAD
      LOCAL f AS STRING
      f = "#.###"
      QueryPerformanceFrequency qFreq
      QueryPerformanceCounter qStart ' Intel suggestion. First use may be suspect
      QueryPerformanceCounter qStart ' So, wack it twice <smile>
      QueryPerformanceCounter qStop
      qOverhead = qStop - qStart     ' Relatively small
    END MACRO
    
    MACRO goTimer = QueryPerformanceCounter qStart
    MACRO stopTimer = QueryPerformanceCounter qStop
    
    MACRO showTimer = USING$(f,(qStop - qStart - qOverhead)*1000000/qFreq /1000) + " milliseconds"
    '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    FUNCTION PBMAIN () AS LONG
    
        LOCAL singVar AS SINGLE, doubVar AS DOUBLE
        LOCAL ctrlWord AS WORD
        LOCAL ii AS LONG
        ontimer
        gotimer
        !fstcw ctrlWord
        !and ctrlWord, &b1111110011111111 ;set to single precision
        !fldcw ctrlWord
        FOR ii = 1 TO 100000000
           singVar = singVar + 1.1
        NEXT
        stoptimer
        ? showtimer
    
        gotimer
        !or  ctrlWord, &b0000001000000000 ;set to double precision
        !fldcw ctrlWord
        FOR ii = 1 TO 100000000
           doubVar = doubVar + 1.1
        NEXT
        stoptimer
        !or  ctrlWord, &b0000001100000000 ;set back to ext precision
        !fldcw ctrlWord
        ? showtimer
    
    END FUNCTION
    Last edited by John Gleason; 8 Apr 2008, 06:12 PM. Reason: added code to set FPU precision

    Comment


      #3
      Twenty-some years ago there was a difference. 8088's and 8087's. Now? Hrmmmmmmph. Negligible.

      Best regards,

      Bob Zale
      PowerBASIC Inc.

      Comment


        #4
        James,
        The 8087 Floating Point co-processor (the original Intel FPU) lists Single precision multiply as taking 11.9usec and extended precision muliply as taking 16.9 usecs, that's 42% slower. Add/Subtract were not iterative so they took the same time as each other to execute. More complex instructions were iterative and took longer for more bits.


        These days the FPU is a lot faster and does the multiply of any size at equal speed but the more complex instructions are still done iteratively and, if set to a lower precision mode, the FPU will run faster.
        Examples from the Athlon Optimization Guide:
        FMUL (any size) latency=4 cycles

        FDIV single latency= 16 cycles
        FDIV double latency= 20 cycles
        FDIVT extended latency= 24 cycles (50% slower thqn SINGLE)

        FSQRT single latency= 19 cycles
        FSQRT double latency= 27 cycles
        FSQRT extended latency= 35 cycles

        Also, for large arrays of data, the memory bandwidth can become a problem. A DOUBLE takes twice the memory of a SINGLE and so a large array of DOUBLEs can take twice as long to fetch from and write to memory than the same sized array of SINGLEs even if the FPU calculates at the same speed.

        EXTs are worse because they are not only larger, at 10 bytes, but they are also either not aligned in memory (being spaced at 10 bytes instead of 8 or 16) which can make them considerably slower or they are spaced out at 16 bytes which aligns them but now need twice the memory to be fetched when compared to a DOUBLE.

        Paul.

        Comment


          #5
          Thanks to everyone for their comments.

          Jim

          Comment


            #6
            When comparing real world performance between single & double precision, usually the real differences in times come from the doubled bandwidth needs: so half data transfered in the same time, half data in the same cache size, and so on.

            Edit:
            Paul post obviously come before!

            Bye!
            -- The universe tends toward maximum irony. Don't push it.

            File Extension Seeker - Metasearch engine for file extensions / file types
            Online TrID file identifier | TrIDLib - Identify thousands of file formats

            Comment


              #7
              John,
              be careful with your testing.
              When the FPU is set to EXTENDED precision, as it is with PB, ALL calculations are done to EXTENDED precision, even those with SINGLEs and DOUBLEs. The EXTENDED precision result is then converted to the required precision after the calculation is complete, when the result is stored.

              The FPU has bits in the control register to tell it what precision to use. If it's told to work in SINGLE precision via the flags then it will stop calculation when that precision is met and this will result in faster results.

              If the FPU is set to work in EXTENDED precision then it's not enough to just use a SINGLE variable as that variable will be converted by the FPU to EXTENDED when it's loaded, the calculation will be performed in EXTENDED precision and the result will then be converted back to SINGLE for storing back in memory.

              Paul.

              Comment


                #8
                >>the calculation will be performed in EXTENDED precision and the result will then be converted back to SINGLE for storing back in memory.

                "IC," said the newbie-fpu man, as he picked up his HADDPD and SAHF.

                Comment


                  #9
                  James--

                  Obviously, you can measure the difference in speed between calculating a single and a double and an extended float. However, in the context of a complete application, float opcodes represent just a part of the total, Typically, a small part. I'd be very surprised to see any application where the measurable total difference was more than 1-2%.

                  Bob Zale
                  PowerBASIC Inc.

                  Comment


                    #10
                    I have accumulated some experience with intensive floating point calculations, and I always set all float variable to DOUBLE, with negligible loss of speed but much gain in "safety".
                    However I am with Marco: there is a situation in which it can be convenient to switch to SINGLE (and I have done it a couple of times).
                    This happens when precision is not at all an issue, and at the same time a very large amount of data need to undergo a very simple manipulation, such as a single addition/subtraction/multiplication.
                    Since the CPU is usually much faster than the pipeline sending data back and forth from memory, under the above circumstances the bottleneck of the computation is the transfer of data back and forth from memory and not the arithmetic operations, and having half-sized data will speed it up by a factor of 2.
                    Regards
                    Aldo Vitagliano
                    alvitagl at unina it

                    Comment


                      #11
                      have accumulated some experience with intensive floating point calculations, and I always set all float variable to DOUBLE, with negligible loss of speed but much gain in "safety
                      I think I remember reading somewhere in the help file the PB compilers are optimized for EXT floats.

                      Which suggests ..... when using these compilers one should use EXT for floats and convert to SINGLE/DOUBLE if necessary/wanted only after completion of calculations .... that is , in the absence of a compelling reason to use SINGLE/DOUBLE for the actual calculations.

                      i.e., for integers use LONG, for floats use EXT to take advantage of the compiler's optimizations.

                      If I remembered this EXT thing incorrectly, never mind.
                      Last edited by Michael Mattias; 9 Apr 2008, 09:02 AM.
                      Michael Mattias
                      Tal Systems (retired)
                      Port Washington WI USA
                      [email protected]
                      http://www.talsystems.com

                      Comment


                        #12
                        Michael,
                        I think you remember incorrectly. Never mind.
                        Individual EXTs can use register variables which does speed them up considerably and makes them faster than SINGLES and DOUBLES but for arrays of variables there aren't enough registers to hold the array.

                        Paul.

                        Comment


                          #13
                          One for two ain't bad. (I KNOW "LONG" is the 'datatype of choice' for integers).

                          I do next to nothing with floats. About 99% of my numeric needs are covered by integers and money, for which I use LONG and CUR/CUX respectively.
                          Michael Mattias
                          Tal Systems (retired)
                          Port Washington WI USA
                          [email protected]
                          http://www.talsystems.com

                          Comment

                          Working...
                          X
                          😀
                          🥰
                          🤢
                          😎
                          😡
                          👍
                          👎