Announcement

Collapse
No announcement yet.

asm multiply tricks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • asm multiply tricks

    The old MUL instruction is very slow on modern pentiums so if
    you have to hard code a multiplication into an algorithm, there
    is an alternative way using LEA which is much faster.

    Code:
          ! mov eax, var
      
          ' ! lea eax, [eax+eax]    ; x 2
          ' ! lea eax, [eax*2+eax]  ; x 3
          ' ! lea eax, [eax*4]      ; x 4
          ' ! lea eax, [eax*4+eax]  ; x 5
      
          ' ! lea ecx, [eax*2]
          ' ! lea eax, [eax*4+ecx]  ; x 6
      
          ' ! lea ecx, [eax*2+eax]
          ' ! lea eax, [eax*4+ecx]  ; x 7
      
          ' ! lea eax, [eax*8]      ; x 8
      
          ' ! lea eax, [eax*8+eax]  ; x 9
      
          ' ! lea ecx, [eax*2]
          ' ! lea eax, [eax*8+ecx]  ; x 10
    LEA is often a source of AGI stalls but even so, it is a lot
    faster than MUL.

    Regards,

    [email protected]

    ------------------
    hutch at movsd dot com
    The MASM Forum - SLL Modules and PB Libraries

    http://www.masm32.com/board/index.php?board=69.0

  • #2
    Sorry, just playing... But interesting...

    Code:
    Loop of 200,000,000
    
    	loop overhead
    		3.02
    		3.02
    		3.02
    		3.02
    
    	l = l * 10
    		5.37 - 3.02 = 2.35
    		5.38 - 3.02 = 2.36
    		5.44 - 3.02 = 2.42
    		5.44 - 3.02 = 2.42	
    
    	! mov eax, var
            ! lea ecx, [eax*2]
            ! lea eax, [eax*8+ecx]
    		4.34 - 3.02 = 1.32
    		4.23 - 3.02 = 1.21
    		4.23 - 3.02 = 1.21
    		4.29 - 3.02 = 1.27
    Colin Schmidt

    ------------------
    Colin Schmidt & James Duffy, Praxis Enterprises, Canada
    [email protected]

    Comment


    • #3
      Steve,

      Thanks a bunch for this post. I've been using this trick quite
      a bit!

      Could you elaborate a little on what an "AGI stall" is?

      BTW, it is possible to hard code almost any multiple (other than
      prime #'s > 5) using just one register.

      Code:
      ' ! lea eax, [eax + eax]      
      ' ! lea eax, [eax*2+eax]  ; x 6
      
      
      ' ! lea eax, [eax*4 + eax]      
      ' ! lea eax, [eax+eax]    ; x 10


      ------------------
      Bernard Ertl
      Bernard Ertl
      InterPlan Systems

      Comment


      • #4
        Bern,

        Its not an easy question to answer in a hurry but the general outline is that
        the pipeline in a later Intel processor is a 5 stage production line. While
        any single instruction takes 5 cycles at best to go through the pipeline, because
        instructions are sequenced one after another, the instruction completion rate
        is then about 1 cycle in theory.

        Dual pipelines theoretically allow 2 instruction to pass through in the time
        it takes to process one but there are a very complicated set of rules to get
        this to happen reliably and it cannot always be done. Intel introduced by
        stealth the idea of a preferred subset of instructions that pair better through
        the pipeline and this is more or less a "Reduced Instruction Set Computer"
        concept, RISC.

        Backwards compatibility allows all of the old instructions but some of them
        are very slow in comparison to the small preferred instructions. AGI means
        "Address Grid Interlock" and it means effectively that a register that is within
        the pipeline cannot be accessed by another instruction that is entering the
        pipeline. The instruction that requires the register that is already in the
        pipeline has to wait until the first one is retired so it "STALLS" until
        this happens.

        This very seriously slows code down and while it just does not matter in many
        cases, if you have a very small inner loop that repeatedly uses the same
        register, there will be some delay on each iteration. LEA can be a culprit
        because it can multiply reference a register,
        Code:
            ! lea eax, [eax*eax+1]
        The best around is Agner Fog's optimisation manual which is on my home site at
        http://www.pbq.com.au/home/hutch/agner.htm

        I have produced the winhelp version for him as it is very good information
        to have available when chasing speed where you need it.

        Regards,

        [email protected]


        ------------------
        hutch at movsd dot com
        The MASM Forum - SLL Modules and PB Libraries

        http://www.masm32.com/board/index.php?board=69.0

        Comment


        • #5
          Steve,

          Thanks. Your explanation is clear.

          Even with the stalls, a couple of back to back !lea instructions
          still clock faster than a !mul (or !imul).

          Any idea if there is any (performance) difference between:

          !lea eax, [eax+eax]

          vs.

          !lea eax, [eax*2]



          ------------------
          Bernard Ertl
          Bernard Ertl
          InterPlan Systems

          Comment


          • #6
            Bern,

            From the literature I have seen, LEA EAX, [EAX+EAX] is internally simpler
            than LEA EAX, [EAX*2] but I think that ADD EAX, EAX would be preferred as
            it is generally faster.

            When you get to the stage where you are clocking the main loop after you
            have it all running properly, you can think in terms of spacing back to
            back LEA instructions with some of the increments that may be performed
            in the LEA.
            Code:
                LEA ---
                ADD ---
                INC ---
                LEA ---
            Instruction re-ordering where it is possible can often work for you if you
            have enough other things to do in the loop.

            Another trick if the code design allows it is to "unroll" the loop. This
            means if you have a simple short loop that appears to be suffering from
            loop overhead, if you double what is being done in the loop, you half the
            loop overhead.

            This MASM code is a simple byte scanner to find a zero string terminator.
            Code:
                @@:
                  mov dl, [eax]         ;  1
                  inc eax               ;  1
                  cmp dl, dh            ;  1
                  jne @B                ;  3 when jmp taken
            Because it is such a small loop, the loop overhead is a considerable factor
            in how fast it runs. Unrolling it by a factor of 2 reduces the loop overhead
            and it clocks faster.
            Code:
                @@:
                  mov dl, [eax]         ;  1
                  inc eax               ;  1
                  cmp dl, dh            ;  1
                  je @F                 ;  1 when not taken
                  mov dl, [eax]         ;  1
                  inc eax               ;  1
                  cmp dl, dh            ;  1
                  jne @B                ;  3 when jmp taken
                @@:
            A straight cycle count shows that for every character compared in the first
            loop, it takes theoretically 6 cycles where in the unrolled loop you get 2
            characters compared for 10 cycles which is theoretically 5 cycles for each
            comparison.

            You can actually make this loop shorter but it does not run as fast,
            Code:
                @@:
                  inc eax               ;  1
                  cmp [eax-1], dh       ;
                  je @F                 ;  1 when not taken
                  inc eax               ;  1
                  cmp [eax-1], dh       ;
                  jne @B                ;  3 when jmp taken
                @@:
            Comparing the memory operand [eax-1] is slower than using the RISC approach
            of moving the address into a register first then doing the comparison.

            The advantage when you are doing more complex things than a simple byte scanner
            is that you have a longer loop where you have more options with instruction
            re-ordering.

            Regards,

            [email protected]

            ------------------
            hutch at movsd dot com
            The MASM Forum - SLL Modules and PB Libraries

            http://www.masm32.com/board/index.php?board=69.0

            Comment


            • #7
              Is this discussion applicable to Pentium-class chips in general, or only to specific
              models and, if so, which?

              I've seen a lot of "optimizations" where, say, the new code worked twice as fast on a
              Pentium II, half as fast on a Pentium III, and exactly the same on an AMD K6. This sort
              of thing leaves me with serious questions about the value of trying to optimize at the
              opcode level unless you have a very specific target machine in mind.


              ------------------
              Tom Hanlin
              PowerBASIC Staff

              Comment


              • #8
                Tom,

                I think you have a good point here, I own a K6 550 AMD as a backup box that
                has the fast LOOP instruction that required a patch in win95b so that Windows
                would run.

                Pipeline technology started in PCs with the 486, dual pipelines in early
                pentiums upwards and similar technology is being used by AMD in their later
                processors, Athlon etc ..

                The PII and PIII have very similar architecture so much of what works on one
                works on the other OK. It is when you specifically target a narrow feature of
                one processor that you pay a penalty on others. Writing XMM restricts you to
                PIII and later Intel processors, MMX rules out early pentiums and 486
                processors so if you want you software to run on processors from 486 upwards,
                you are basically stuck with using the mixed integer instructions and
                standard floating point range.

                With the example, LEA had a serious workover in the 486 so most of the usage
                of LEA is valid from 486 up. Unrolling a loop is basic logic, do more in each
                iteration of a loop and you reduce the overhead in many instances.

                The problem is that reverting back to pre 486 code design does not help, the
                technology for pre pipeline processors is well understood and it generally
                meant chosing the shortest set of instructions to do the job, the string
                instructions are a good example of this, they area shorter instructions and
                they run faster on a 386 and lower but "who cares" when you are writing code
                that requires the instruction range from 486 upwards.

                Instruction like LOOP/LOOPNZ JCXZ are leftover 386 technology that are a lot
                slower on a 486 upwards than the RISC style replacements. The exceptions are
                REP MOVSD and REP STOSD which seem to be well optimised on later pentium
                processors.

                You can often get smaller code by the pre 486 indexing techniques and often
                it will run faster but there are some notable exceptions, the example I cited
                above,
                Code:
                    mov dl, [eax]         ;  1
                    inc eax               ;  1
                    cmp dl, dh            ;  1
                almost exclusively outperforms the shorter form,
                Code:
                    inc eax               ;  1
                    cmp [eax-1], dh
                The choices seem to be if you want code to run well on most processors is to
                stick to the most general set of characteristics, everyone is faced with this
                choice from compiler writer to programmers doing specific targetting in a
                program and as the reference material is much the same to all interested
                parties, you are faced with leaving the old DOS stuff behind and modernising
                code unless you want it to run badly on all post 486 processors.

                Regards,

                [email protected]

                ------------------
                hutch at movsd dot com
                The MASM Forum - SLL Modules and PB Libraries

                http://www.masm32.com/board/index.php?board=69.0

                Comment


                • #9
                  Hutch
                  In general I agree with you, however I think the following construct would be even quicker.

                  Code:
                  	comp [eax], dl
                  	inc eax
                  this would make the code for your byte scanner as follows

                  Code:
                  	! mov dl, 0
                  	! mov eax, startaddress
                  Lp:
                  	! cmp [eax], dl
                  	! je Lend
                  	! inc eax
                  	! cmp [eax], dl
                  	! je Lend
                  	! inc eax
                  	! jmp Lp
                  Lend:
                  I tried pointing out some of the speed variables due to pipeline processing about a year ago, but as I suggested that in some cases it could make the PB sacred cow of register variables slower than memory, it was taken poorly. I have tested a range of Intel, AMD and Cyrix processors with wildely varying results. The speed, size and type of the L1 cache being the major factor.


                  ------------------

                  Comment


                  • #10
                    John,

                    I agree with you that the L1 cache is a lot of the reason for
                    variation between processor versions but unless you wish to try
                    processor specific methods like strip mining which will vary
                    considerably from one processor to another, I don't think there
                    is a lot that can be done about it.

                    The question of register variable is one that does not bother me
                    as I write a reasonable amount of code that does not use inline
                    assembler and generally the automatic allocation of registers to
                    variables works OK.

                    Having the capacity to turn register optimisation off does the job
                    fine where you need to code the entire function yourself in assembler.


                    The code you posted looks unusual in that it has an extra unconditional
                    jump but the problem I have found where I have to do byte scanners
                    is that comparing 2 registers is almost exclusively faster that
                    comparing a register to a memory operand.

                    Code:
                        @@:
                          inc eax               ;  1
                          cmp [eax-1], dh       ;
                          je @F                 ;  1 when not taken
                          inc eax               ;  1
                          cmp [eax-1], dh       ;
                          jne @B                ;  3 when jmp taken
                        @@:
                    This code is an instruction shorter than the code you suggested,
                    Code:
                    Lp:
                      ! cmp [eax], dl
                      ! je Lend
                      ! inc eax
                      ! cmp [eax], dl
                      ! je Lend
                      ! inc eax
                      ! jmp Lp
                    Lend:
                    and the difference is trhe unconditional jump at the end of the loop
                    which will generally make the loop slower.

                    I would be interested to see what you tested this loop on.

                    Regards,

                    [email protected]


                    ------------------
                    hutch at movsd dot com
                    The MASM Forum - SLL Modules and PB Libraries

                    http://www.masm32.com/board/index.php?board=69.0

                    Comment


                    • #11
                      Hutch
                      finally had some time to do some testing on an AMD Duron 650.
                      I loaded 99999999 bytes of a 100000000 byte array with non zero values and zero in the last byte. I then tested 4 different methods. 2 you described ie (1) cmp [eax - 1], dh (2) cmp dl, dh (3) my prior post with an unconditional jump and (4) repne scasb. I looped through each test 100 times
                      The results are surprising.
                      (1) 58.88 secs
                      (2) 65.72 secs
                      (3) 61.21 secs
                      (4) 63.22 secs
                      The code for the repne scasb follows, whilst not the fastest it has the added safety factor of always terminating.
                      Code:
                      T3! = TIMER
                      ! mov ebx, 100
                      r4:
                      ! MOV eax, 0
                      ! mov ecx, 100000001
                      ! MOV edi, Sa???
                      ! CLD
                      ! repne scasb
                      ! dec ebx
                      ! jnz r4
                      T4! = TIMER

                      ------------------

                      Comment


                      • #12
                        John,

                        Thanks for posting these results, it shows some hardware difference to
                        the PIII 600 I am working on. They are interesting results for a number
                        of reasons, the timings look very much like the type of spacing that
                        you would get on a 486 but much faster with the 650 Duron.

                        the reg, [mem] compare is in line with 486 optimisation, the extra step
                        of loading the register before doing the comparison appears to be taking
                        the extra time of another instruction.

                        Your 1st algo with the extra jump is slightly slower but surprisingly
                        enough the SCAS instruction has performed a lot better than it would on
                        a later Intel processor.

                        Rough guess is that the Duron has a shorter pipeline and faster fetch on
                        memory operands but is slower on small instructions so its optimised code
                        type would favour shorter instruction paths and less expansion into small
                        instructions.

                        Regards,

                        [email protected]

                        ------------------
                        hutch at movsd dot com
                        The MASM Forum - SLL Modules and PB Libraries

                        http://www.masm32.com/board/index.php?board=69.0

                        Comment


                        • #13
                          >but I think that ADD EAX, EAX would be preferred as
                          it is generally faster.
                          I found this thread whilst searching for the fastest way to X by 2.
                          Is add faster than shl too?

                          Comment

                          Working...
                          X