Announcement

Collapse
No announcement yet.

Optimization slows down my code

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimization slows down my code

    I am trying to speed up some code, but I'm baffled with the compiler's response. It seems like sometimes when I remove unnecessary code, instead of it running faster, my routines sometimes run slower. And changing code which would appear to be irrelevent tends to affect the program's speed as well.

    I have a DLL function that takes two arguments. I figured that by rearranging the code so that it takes only one argument would make it faster, but just the opposite seems to happen. I don't know if this is just on my machine (PIII 600Mhz). So hopefully others will test out the sample code below.

    Here are the tests to try. First compile and run the program as given below. Then comment out the line that says: Answer = TestA(1, 1) , and un-comment the one that says: Answer = TestB(1). And compare the speeds. On my machine the one with one argument is remarkably slower than the one with two.

    Here's another test. Comment out the first Dim statement, and un-comment the line below it to see the speed difference. Now the 1 arg function runs faster, but why? No code was changed within the loop. I thought that maybe register variables might have something to do with it, as I've had problems with it before, so I added "$Register None", but that apparently wasn't it.

    I'm trying to optimize my code, but the above phenomena makes it difficult. Hopefully, PB tech support can explain why this happens, so that I might have an easier time optimizing my code. For now, it seems like I'm kind of poking in the dark in some of my attempts.

    Code:
    $Compile Exe "test.exe"
    DefLng A-Z
    
    Function TestA(ByVal Arg1, ByVal Arg2) As Ext
       TestA = 1 + 1
    End Function
    
    Function TestB(ByVal Arg1) As Ext
       TestB = 1 + 1
    End Function
    
    Function WinMain (ByVal hCurInstance  As Long, _
                      ByVal hPrevInstance As Long, _
                      lpszCmdLine         As Asciiz Ptr, _
                      ByVal nCmdShow      As Long) Export As Long
    
        Dim x As Long, y As Ext, z As Long, Answer As Ext, q As Long
        'Dim x As Long, Answer As Ext
    
        t! = Timer
        For x = 0 To 100000000
            Answer = TestA(1, 1)
        '    Answer = TestB(1)
        Next
        tt! = Timer-t!
        MsgBox Str$(tt!)
    
    End Function
    One other strange thing. If I remove the "q As Long" from the Dim statement
    (q is not even used anywhere), TestA runs much slower.


    ------------------
    Daniel Corbier
    UCalc Fast Math Parser
    http://www.ucalc.com
    Daniel Corbier
    uCalc Fast Math Parser
    uCalc Language Builder
    sigpic

  • #2
    daniel,
    i confirm your results (on a piii,750mhz, nt4). i've had some discussions about this strange
    behaviour some time ago (without real explanations) see:

    (if you add another unused variable to your dim statement after 'q as long',
    the code becomes much more slowly, and adding a second one speeds it up again...)
    (seems the compilers optimizing needs some optimizing... )

    ------------------
    peter.
    mailto[email protected][email protected]</a>

    [this message has been edited by peter lameijn (edited january 07, 2001).]
    Regards,
    Peter

    Comment


    • #3
      Even weirder - move q to be declared first and use z for the loop,
      gave me a bit faster execution of the code. Probably has something
      with registers to do. Another thing:

      Calling functions in loops always takes extra time. If the function
      is short enough and/or you only call it from a few places, you'll
      get much better speed by putting it all inside the loop.

      "FOR x = 0 TO 100000000 : Answer = 1 + 1 : NEXT" is much faster,
      than repeatedly calling a separate function to do the work. For
      time critical loops. it's worth including even very long routines.


      ------------------

      Comment


      • #4
        Daniel;

        I'll take a guess at a possible cause of the problem.

        The PB compiler will attempt to use Register variables to optimize
        the code. Every time you change the code, you may be making the compiler
        pick a different variable to use as a Register variable.

        Try your tests with #REGISTER NONE first. This will elimate any
        odd results caused by the compiler using Register variables.

        IMO, I prefer to choose which variables are register variables,
        rather than let the compiler do it.

        Your code is using Extended real variables which can be turned into
        Register variables (up to 4 according to PB docs). While I can't
        guarantee this is the cause, it is worth investigating.




        ------------------
        Chris Boss
        Computer Workshop
        Developer of "EZGUI"
        http://cwsof.com
        http://twitter.com/EZGUIProGuy

        Comment


        • #5
          Daniel;

          I did some testing and it turns out the Register variables generated
          by the compiler are NOT the cause.

          It turns out that if you compare functions with 4 and 2 parameters
          the more parameters the function has the slower it is, as expected.
          . The one exception seems to be when functions with 2 and 1 are compared.

          For some reason the function with 1 parameter is slower than the one
          with 2. Likely this has something to do with how the compiler
          attempts to store the variable when there is only one parameter.

          IMO, the difference between the two functions is not significant
          as far as speed. The results are quite close. It can be assumed that
          once you have more than two parameters, the greater the number
          of parameters the slower the function is. Since the 1 parameter function
          is slightly slower than the one with 2, is not a big deal as
          long as the difference isn't very large.

          Likely the compiler has a good reason for what it is doing. Maybe
          there is a positive result in some other area. Likely only
          an assembler programmer could tell you why there is a difference in
          speed. Also another consideration is how different CPUs handle
          different assembler codes. Not all CPUs are equal. PB tend to target
          the lower end CPU rather than higher end. It would be interesting to
          see how the results would compare on a 386 or 486 ! I am using
          a Cyrix 686 - 200 MHZ.

          I wouldn't worry about the results. Most optimizations made according
          to the PB recommendations will likely result in a speed increase.

          Your loop is a hundred million iterations. I changed the code (I tested)
          to only 10 million iterations and the difference was only a few tenths
          of a second. Thats not bad !



          ------------------
          Chris Boss
          Computer Workshop
          Developer of "EZGUI"
          http://cwsof.com
          http://twitter.com/EZGUIProGuy

          Comment


          • #6
            This should stir up the pot a little...

            The results of these kind of "simple" tests will do more to highlight performance variations that depend on the brand and internal design of the processor used for the benchmark (Note: I use that term loosely ).

            Chris commented on compiler optimizations... From past discussions with R&D, it transpires that the compiler makes specific optimizations in order to get the best possible results from the widest range of processors. Depending on the exact type of benchmark being conducted, an AMD or CYRIX could outperform a PII/PIII/PIV, and another test could yield totally opposite results. No one ever said AMD and INTEL produced identical processors - they are designed to be "compatible" with each other.

            In summary, simple "non-real-world" tests like the one above will tend to be suceptable to the effects of processor optimization and design.

            Of course, for those of us that want the utmost performance from their applications can always use PowerBASIC's inline-assembler where you can have complete control over what goes on. As a side benefit, you'll probably learn quite a lot about writing processor-specific optimized code.

            I hope this helps...

            ------------------
            Lance
            PowerBASIC Support
            mailto:[email protected][email protected]</A>
            Lance
            mailto:[email protected]

            Comment


            • #7
              My actual source code is quite intricate. However, I took the time to reduce it to it's simplest terms, and then I modified it so that I could post something concise and clear enough for anyone to follow. This way others don’t have to wade through a lot of irrelevant lines to see what’s happening. Speed problems with my "real-world" code are what caused me to post this sample program. The actual code I'm having difficulty optimizing does not resemble what I've posted (it's not even an .Exe program). The lines above simply demonstrate the essence of the problem I’m facing.

              Anyway, for my product, every measurable fraction of a second does make a big difference in the time critical section. By the way, on my system, the above code takes around 8.5 seconds with two arguments, and around 12 seconds with one. That's not a negligible difference. Although the speed difference is less dramatic in my “real-world” actual code, it still makes enough of a difference for me to wonder what’s going on.

              This seemingly random compiler behavior makes optimizing my code time consuming and not very productive. Lance, if you can explain just what kind of optimizations the compiler does, this may help me channel my efforts in the right direction. You guys have explained how long integers, pointers, and registers can affect code speed. Now if you can explain how the compiler “optimizations” might affect our code, it would be greatly helpful. Is there some kind of rule where functions run faster when arguments come by two s? Is it possible that although a second argument makes it run a little faster on the PIII, it might make it twice as slow on another processor? I can’t afford to guess on these things. This is not light issue for my product. See the testimonial on my website of how a major airline typically does millions of calculations per aircraft with my product.

              PB prides itself in the speed of its compiler. I hope that it will take this issue seriously. One would assume that trimming down code (especially dead code) would speed up a routine. We should be informed about possible exceptions to this, especially if the exception is not obvious.

              For the record, my actual does not consist of adding 1+1 in a loop. The above example is just to illustrate the unexplainable compiler behavior that affects my routines, but in a different kind of setting. If the problem is solved at this simple level, I can apply it in the actual setting.

              By the way, I’d be interested in beta testing (I sent e-mail to support but I didn’t hear from anyone). I mostly want to make sure that my code will compile. When I moved from VB 5 to 6, I had to remove the Register metastatement so that my code would work. I want to see if I can use it now.


              ------------------
              Daniel Corbier
              UCalc Fast Math Parser
              http://www.ucalc.com
              Daniel Corbier
              uCalc Fast Math Parser
              uCalc Language Builder
              sigpic

              Comment


              • #8
                Daniel, the level of information you request has never been released by R&D, so I'm afraid that Tech Support is not able to provide the answers you seek.

                Of course, I'll ask R&D about these questions on your behalf, but I'm guessing that publishing information of that level could impose on intellectual property rights, but I will post any information I can gather.

                BTW, I can confirm that your request to become a beta tester was received and forwarded to R&D. However, your request provided R&D with no Resume (background information) on your experience. You may wish to consider resubmitting a more comprehensive application.

                Please note that beta team membership is tightly controlled and candidates are vetted carefully by R&D. Suitable candidates must be experienced (PowerBASIC) programmers who are willing to dedicate "significant" time to the task, and continue to do so on a daily basis for the duration of the testing phases.

                Even if you are willing to fulfil these requirements, there is still no guarantee that your application will be accepted by R&D.

                I hope this helps!


                ------------------
                Lance
                PowerBASIC Support
                mailto:[email protected][email protected]</A>
                Lance
                mailto:[email protected]

                Comment


                • #9
                  I can understand, that optimizations vary from CPU types, but its strange that declaring a never
                  used variable has (very much) influence on the compiled programs speed. (If I Dim 1 more
                  long in Daniels example, speed drops by 30%!
                  Even if the compiler doesn't strip unused variables, it shouldn't affect the codes
                  execution speed?


                  ------------------
                  Peter.
                  mailto[email protected][email protected]</A>



                  [This message has been edited by Peter Lameijn (edited January 09, 2001).]
                  Regards,
                  Peter

                  Comment


                  • #10
                    Two words: Register Variables.

                    ------------------
                    Lance
                    PowerBASIC Support
                    mailto:[email protected][email protected]</A>
                    Lance
                    mailto:[email protected]

                    Comment


                    • #11
                      Code:
                      Daniel & Peter--
                         
                      Lance's answer might not be the one you're likely to hope for, but 
                      it's the correct one.  The variety of processors in place today is 
                      tremendous, as is their performance.  Strengths and weaknesses vary 
                      greatly.  On my system (with its AMD K6/3-400), the posted code runs 
                      at just about the same speed no matter which lines or variables I rem 
                      out.  There's nothing "random" about what's happening.  And yes, the 
                      potential differences in speed certainly fall within the ranges that 
                      you've identified. 
                         
                      What this implies is that performing lots of benchmarking on a single 
                      system rarely reveals anything whose implications are broad enough to 
                      justify the time spent.  Spend that time reading up on processor 
                      design and caching.  That's what compiler writers do. 
                         
                      Then aim right down the middle.  You'll be able to, even if you can't 
                      see it on your particular system. 
                         
                      --Greg

                      ------------------
                      -- Greg
                      [email protected]

                      Comment


                      • #12
                        There is a coding variable that has not been mentioned here, it is ALIGNMENT.

                        From Daniel's original example,

                        Dim x As Long, y As Ext, z As Long, Answer As Ext, q As Long

                        This may or may not help but put the 32 bit size values first, one after
                        another, then the 80 bit variables next and see if this makes a
                        difference, I have run into my share of code in PB that is alignment
                        sensitive so it is worth a try. The variation range looks like alignment
                        problems so perhaps experiment with a few dummy variables to pad the
                        alignment and see if you get a change in speed, either way.

                        Regards,

                        [email protected]

                        ------------------
                        hutch at movsd dot com
                        The MASM Forum

                        www.masm32.com

                        Comment


                        • #13
                          With #REGISTER DEFAULT (the default setting):
                          Code:
                          Dim x As Long, y As Ext, z As Long, Answer As Ext, q As Long
                          will make X, Y, Z and Answer register variables. Therefore, adding "padding" variables will either remove the register variable status of some variables and/or change the relative position of variable Q within memory.

                          However, I'm 99.9% sure that the compiler will only position memory variables on DWORD boundaries (except for within arrays and UDT's, etc), so I don't believe that padding variables will make much difference per se, but a givin CPU could work more efficiently with certain memory combinations I suppose.

                          AFAIK, the order of declaration of memory variables (and the 1st register variable) is the most important aspect.

                          What we need here is a CPU guru...

                          After-thought: Back in the beginning of this thread, Borje suggested what I consider as to be the best solution for Daniel to optimize his code in terms of speed. That is, move the core function code into the loop, thereby removing a significant overhead of calling the target sub/function (which involves setting up a stack frame each time, allocating LOCAL variable storage, etc).

                          While this may make Daniel's DLL larger, the target here is speed of execution. Smaller code does not always correlate to the fastest code...

                          ------------------
                          Lance
                          PowerBASIC Support
                          mailto:[email protected][email protected]</A>
                          Lance
                          mailto:[email protected]

                          Comment


                          • #14
                            Two words : Register none.
                            Used that and still 30% difference......

                            ------------------
                            Peter.
                            mailto[email protected][email protected]</A>
                            Regards,
                            Peter

                            Comment


                            • #15
                              Two More Words:
                              Inline Functions - The Compiler would handle the inlining...

                              Code:
                               
                              #Register None
                              #Option Version4
                              #Dim All
                              Declare Inline Function TestA (One As Long ) As Long
                              Declare Sub DoTestA_Again(DaVal As Long
                               
                              Inline Function TestA (One As Long ) As Long
                                Function = One + 1&
                              End Function
                              
                              Function PBMain As Long
                                Dim l As Long
                                For l = 1 To 12
                                  StdOut "TestA Val:" & Str$(TestA(l))
                                Next
                                Call DoTestA_Again (l)
                              End Function
                               
                              Sub DoTestA_Again(DaVal As Long
                                StdOut "TestA Val:" & Str$(TestA(DaVal))
                              End Sub
                              ------------------
                              Ron

                              [This message has been edited by Ron Pierce (edited January 09, 2001).]

                              Comment


                              • #16
                                What we need here is a CPU guru...
                                [flame]

                                No, what we need here is non-trivial benchmarks. Simple loops like this really have very little bearing on the performance of an [bold] application. [/bold].

                                I have said before and will say again: most poor performance in software comes from poor design, not the vagaries of complier algorithms.

                                Someone pointed out above the "design" which uses a FUNCTION inside a loop. Instead of a calling a FUUNCTION, unroll the FUNCTION into in-line code inside the loop and eliminate the stack setup overhead.

                                Bottom line: It's not the paintbrush, it's the artist.
                                [/flame]




                                ------------------
                                Michael Mattias
                                Racine WI USA
                                [email protected]
                                Michael Mattias
                                Tal Systems Inc. (retired)
                                Racine WI USA
                                [email protected]
                                http://www.talsystems.com

                                Comment


                                • #17
                                  In the trivial example, inlining the code would be a good optimization.
                                  If, on the other hand, the TestA or TestB functions generated considerable
                                  code and were called upon numerous times by various functions, inlining might
                                  be overruled by code size.

                                  What is relevant is the "benchmark" example was valid. Lost the for..next
                                  and copy/paste the function 100000 times to see...

                                  This isn't art!


                                  ------------------
                                  Ron

                                  Comment


                                  • #18
                                    Daniel;

                                    The best advice given so far was to test your code on multiple
                                    PCs (different CPUs) for a comparison.

                                    Here is a factor that has been over looked :

                                    Advanced CPUs don't simply just increase the speed by increasing
                                    the MHZ speed. Some of the greatest speed improvements are generated
                                    by something called a PipeLine and by Caches.

                                    To get a real gauge of how the code runs on a particular CPU, you
                                    have to turn off "all" BIOS optimizations for the CPU, such as
                                    caches (internal and external) and No PipeLines.

                                    When the CPU is running optimized, "strange" things can happen. Some CPU's
                                    actually "read ahead" and process machine code out of linear order .
                                    It is a thing called "prediction". If the proper byte codes are there
                                    the CPU can actually execute two machines codes at the same time. If
                                    the byte codes can't be run simultaneously, the CPU shifts down
                                    a notch and runs them one at a time.

                                    Steven Pringels posted a URL recently of a deep explanation of how
                                    different CPUs handle things. I'll try to dig it up and post it.

                                    The way CPUs optimize code is amazing.

                                    The main point is that CPUs no longer execute code in a linear fashion
                                    and they do all sorts of tricks to speed things up. Many modern C
                                    compilers offer all sorts of optimizations just to take advantage
                                    of these "tricks". If the code is standard 8088 code that doesn't use
                                    those tricks, then some code that you think should be faster will
                                    actually be slower on the high end CPU.

                                    This is not the fault of the PB compiler. While it may be true that
                                    PB should consider "adding" CPU optimization to the compiler (meaning you
                                    choose the CPU and the compiler optimizes for it), the problem you
                                    are now experiencing is "not" the fault of the PB compiler. The
                                    problem is not some kind of random and strange behavior of the compiler.
                                    The problem is the "strange behavior" of many high end CPUs.

                                    This is definitely a CPU problem and not a compiler problem.


                                    ------------------
                                    Chris Boss
                                    Computer Workshop
                                    Developer of "EZGUI"
                                    http://cwsof.com
                                    http://twitter.com/EZGUIProGuy

                                    Comment


                                    • #19
                                      Daniel;


                                      This link is a "must read" for you :

                                      http://www.emulators.com/pentium4.htm

                                      It took me an hour and I only got 3/4 the way through the document.
                                      It is long, but definitely worth reading !

                                      You will feel like an Intel CPU Engineer by time your done.

                                      If your code must be optimized, then benchmarking on multiple CPUs
                                      is "critical" . Find what works best on each CPU and the make a list
                                      of optimization tricks.

                                      Be forewarned, that the tricks needed on a high end CPU many times
                                      relate to the order of the machine codes (which you can't control
                                      very well with the PB compiler). This means the only real way to
                                      optimize in some situations is to resort to assembler. Even with
                                      high end C compilers you still have to do this.

                                      Also if your code must be run on multiple CPUs, you may want to "test"
                                      for which CPU is being used and then use different sections of code
                                      for each CPU type. As Lance mentioned, forget about the size of your
                                      DLL. Be prepared to add a lot of extra code to compensate for different
                                      CPU's.


                                      ------------------
                                      Chris Boss
                                      Computer Workshop
                                      Developer of "EZGUI"
                                      http://cwsof.com
                                      http://twitter.com/EZGUIProGuy

                                      Comment


                                      • #20
                                        Michael Mattias wrote:
                                        Bottom line: It's not the paintbrush, it's the artist.

                                        Thank you, Mr. Cliche.

                                        Lance Edmonds wrote:
                                        What we need here is a CPU guru...

                                        Well, I'm not exactly a "guru" (but I play one on TV ), but I'll take a stab at one possibility...

                                        In the P6 archetecture (which encompasses the Pentium Pro, Pentium II / III), the eight general-purpose 32-bit registers that we know and love don't actually exist anymore. Instead, the core contains a whole bunch of 32-bit registers, which the CPU itself decides how to reassign as "reflections", if you will, of the G.P. registers during execution time. (Intel refers to this technique as "register renaming".) The reason for this is to increase the CPU's ability to do superscalar out-of-order execution on code where the same register is used repeatedly in the code, but for mutually exclusive operations; the CPU will simply remap two consecutive uses of EAX, for example, into different internal registers while it executes them if it thinks the one isn't dependant on the other. For example:
                                        Code:
                                        SUB   EAX, EBX
                                        MOV   RESULT, EAX
                                        POP   EAX
                                        the SUB/MOV instructions' use of EAX are interdependant, so they have "their" EAX reflected to one internal register and must execute serially; POP, however, obviously doesn't care what the result of the SUB/MOV are, so "its" EAX is reflected to a different internal register. MOV and POP can then be executed simultaneously by the core.

                                        Unfortunately, the P6 is subject to a condition known as "partial register stall". PRS occurs when when a partial register (the AL, AH, and AX parts of the EAX register, for example) get renamed to different internal registers because the processor thinks the uses are mutually exclusive - and then they turn out not to be, and the CPU has to stop in its tracks, throw away the results of its out-of-order execution, and re-do the operations with the correct register mapping. For instance, the following code:
                                        Code:
                                        MOV   AL, TempByte
                                        NEG   AL
                                        SBB   EAX, EAX
                                        sets the carry flag if TempByte <> 0, and then makes EAX equal to either 00000000h or FFFFFFFFh depending on the result of the carry flag. This code results in a stall, because the CPU remaps the AL register and the EAX registers to different internal registers, executes these instructions out-of-order, then discovers its mistake after the fact. The P6 core then flushes the pipeline and decodes the instructions a second time to get it right. (Yes, it seems obvious to us that the three are interdependant, but apparantly not to the P6's instruction decoder.

                                        This problem does not affect the Pentium "Classic" (P5 core) or earlier Intel x86 generations, or the AMD K5 / K6 chips, since they do not use register renaming. It also does not affect the AMD K7 core (Athlon, Duron, Thunderbird), even though K7 does use register renaming, because AMD put design effort into eliminating this particular problem.

                                        While I can't say with certainty how the partial register stall problem might cause the problems Daniel is seeing with his code, since I haven't looked at the assembly-code output from the PB/DLL compiler, it does serve quite handily to refute Michael's assertion that:

                                        I have said before and will say again: most poor performance in software comes from poor design, not the vagaries of complier algorithms.

                                        ...because the Microsoft Visual C++ 6.0 compiler, even with the "P6 family" optimizations turned on, generates code which consistently causes register stalls, while the older VC++ 4.2 with "P5 family" optimizations only generates code which might cause register stalls, and the VC++ 7.0 beta finally gets it right at the expense of larger code size. :P So, depending on which tools the PowerBASIC team is using to create the compiler, and on what code the compiler itself generates...

                                        ------------------


                                        [This message has been edited by Gary Akins (edited January 09, 2001).]

                                        Comment

                                        Working...
                                        X