Announcement

Collapse
No announcement yet.

asm multiply tricks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dean Gwilliam
    replied
    >but I think that ADD EAX, EAX would be preferred as
    it is generally faster.
    I found this thread whilst searching for the fastest way to X by 2.
    Is add faster than shl too?

    Leave a comment:


  • Steve Hutchesson
    replied
    John,

    Thanks for posting these results, it shows some hardware difference to
    the PIII 600 I am working on. They are interesting results for a number
    of reasons, the timings look very much like the type of spacing that
    you would get on a 486 but much faster with the 650 Duron.

    the reg, [mem] compare is in line with 486 optimisation, the extra step
    of loading the register before doing the comparison appears to be taking
    the extra time of another instruction.

    Your 1st algo with the extra jump is slightly slower but surprisingly
    enough the SCAS instruction has performed a lot better than it would on
    a later Intel processor.

    Rough guess is that the Duron has a shorter pipeline and faster fetch on
    memory operands but is slower on small instructions so its optimised code
    type would favour shorter instruction paths and less expansion into small
    instructions.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • John Petty
    replied
    Hutch
    finally had some time to do some testing on an AMD Duron 650.
    I loaded 99999999 bytes of a 100000000 byte array with non zero values and zero in the last byte. I then tested 4 different methods. 2 you described ie (1) cmp [eax - 1], dh (2) cmp dl, dh (3) my prior post with an unconditional jump and (4) repne scasb. I looped through each test 100 times
    The results are surprising.
    (1) 58.88 secs
    (2) 65.72 secs
    (3) 61.21 secs
    (4) 63.22 secs
    The code for the repne scasb follows, whilst not the fastest it has the added safety factor of always terminating.
    Code:
    T3! = TIMER
    ! mov ebx, 100
    r4:
    ! MOV eax, 0
    ! mov ecx, 100000001
    ! MOV edi, Sa???
    ! CLD
    ! repne scasb
    ! dec ebx
    ! jnz r4
    T4! = TIMER

    ------------------

    Leave a comment:


  • Steve Hutchesson
    replied
    John,

    I agree with you that the L1 cache is a lot of the reason for
    variation between processor versions but unless you wish to try
    processor specific methods like strip mining which will vary
    considerably from one processor to another, I don't think there
    is a lot that can be done about it.

    The question of register variable is one that does not bother me
    as I write a reasonable amount of code that does not use inline
    assembler and generally the automatic allocation of registers to
    variables works OK.

    Having the capacity to turn register optimisation off does the job
    fine where you need to code the entire function yourself in assembler.


    The code you posted looks unusual in that it has an extra unconditional
    jump but the problem I have found where I have to do byte scanners
    is that comparing 2 registers is almost exclusively faster that
    comparing a register to a memory operand.

    Code:
        @@:
          inc eax               ;  1
          cmp [eax-1], dh       ;
          je @F                 ;  1 when not taken
          inc eax               ;  1
          cmp [eax-1], dh       ;
          jne @B                ;  3 when jmp taken
        @@:
    This code is an instruction shorter than the code you suggested,
    Code:
    Lp:
      ! cmp [eax], dl
      ! je Lend
      ! inc eax
      ! cmp [eax], dl
      ! je Lend
      ! inc eax
      ! jmp Lp
    Lend:
    and the difference is trhe unconditional jump at the end of the loop
    which will generally make the loop slower.

    I would be interested to see what you tested this loop on.

    Regards,

    [email protected]


    ------------------

    Leave a comment:


  • John Petty
    replied
    Hutch
    In general I agree with you, however I think the following construct would be even quicker.

    Code:
    	comp [eax], dl
    	inc eax
    this would make the code for your byte scanner as follows

    Code:
    	! mov dl, 0
    	! mov eax, startaddress
    Lp:
    	! cmp [eax], dl
    	! je Lend
    	! inc eax
    	! cmp [eax], dl
    	! je Lend
    	! inc eax
    	! jmp Lp
    Lend:
    I tried pointing out some of the speed variables due to pipeline processing about a year ago, but as I suggested that in some cases it could make the PB sacred cow of register variables slower than memory, it was taken poorly. I have tested a range of Intel, AMD and Cyrix processors with wildely varying results. The speed, size and type of the L1 cache being the major factor.


    ------------------

    Leave a comment:


  • Steve Hutchesson
    replied
    Tom,

    I think you have a good point here, I own a K6 550 AMD as a backup box that
    has the fast LOOP instruction that required a patch in win95b so that Windows
    would run.

    Pipeline technology started in PCs with the 486, dual pipelines in early
    pentiums upwards and similar technology is being used by AMD in their later
    processors, Athlon etc ..

    The PII and PIII have very similar architecture so much of what works on one
    works on the other OK. It is when you specifically target a narrow feature of
    one processor that you pay a penalty on others. Writing XMM restricts you to
    PIII and later Intel processors, MMX rules out early pentiums and 486
    processors so if you want you software to run on processors from 486 upwards,
    you are basically stuck with using the mixed integer instructions and
    standard floating point range.

    With the example, LEA had a serious workover in the 486 so most of the usage
    of LEA is valid from 486 up. Unrolling a loop is basic logic, do more in each
    iteration of a loop and you reduce the overhead in many instances.

    The problem is that reverting back to pre 486 code design does not help, the
    technology for pre pipeline processors is well understood and it generally
    meant chosing the shortest set of instructions to do the job, the string
    instructions are a good example of this, they area shorter instructions and
    they run faster on a 386 and lower but "who cares" when you are writing code
    that requires the instruction range from 486 upwards.

    Instruction like LOOP/LOOPNZ JCXZ are leftover 386 technology that are a lot
    slower on a 486 upwards than the RISC style replacements. The exceptions are
    REP MOVSD and REP STOSD which seem to be well optimised on later pentium
    processors.

    You can often get smaller code by the pre 486 indexing techniques and often
    it will run faster but there are some notable exceptions, the example I cited
    above,
    Code:
        mov dl, [eax]         ;  1
        inc eax               ;  1
        cmp dl, dh            ;  1
    almost exclusively outperforms the shorter form,
    Code:
        inc eax               ;  1
        cmp [eax-1], dh
    The choices seem to be if you want code to run well on most processors is to
    stick to the most general set of characteristics, everyone is faced with this
    choice from compiler writer to programmers doing specific targetting in a
    program and as the reference material is much the same to all interested
    parties, you are faced with leaving the old DOS stuff behind and modernising
    code unless you want it to run badly on all post 486 processors.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Tom Hanlin
    replied
    Is this discussion applicable to Pentium-class chips in general, or only to specific
    models and, if so, which?

    I've seen a lot of "optimizations" where, say, the new code worked twice as fast on a
    Pentium II, half as fast on a Pentium III, and exactly the same on an AMD K6. This sort
    of thing leaves me with serious questions about the value of trying to optimize at the
    opcode level unless you have a very specific target machine in mind.


    ------------------
    Tom Hanlin
    PowerBASIC Staff

    Leave a comment:


  • Steve Hutchesson
    replied
    Bern,

    From the literature I have seen, LEA EAX, [EAX+EAX] is internally simpler
    than LEA EAX, [EAX*2] but I think that ADD EAX, EAX would be preferred as
    it is generally faster.

    When you get to the stage where you are clocking the main loop after you
    have it all running properly, you can think in terms of spacing back to
    back LEA instructions with some of the increments that may be performed
    in the LEA.
    Code:
        LEA ---
        ADD ---
        INC ---
        LEA ---
    Instruction re-ordering where it is possible can often work for you if you
    have enough other things to do in the loop.

    Another trick if the code design allows it is to "unroll" the loop. This
    means if you have a simple short loop that appears to be suffering from
    loop overhead, if you double what is being done in the loop, you half the
    loop overhead.

    This MASM code is a simple byte scanner to find a zero string terminator.
    Code:
        @@:
          mov dl, [eax]         ;  1
          inc eax               ;  1
          cmp dl, dh            ;  1
          jne @B                ;  3 when jmp taken
    Because it is such a small loop, the loop overhead is a considerable factor
    in how fast it runs. Unrolling it by a factor of 2 reduces the loop overhead
    and it clocks faster.
    Code:
        @@:
          mov dl, [eax]         ;  1
          inc eax               ;  1
          cmp dl, dh            ;  1
          je @F                 ;  1 when not taken
          mov dl, [eax]         ;  1
          inc eax               ;  1
          cmp dl, dh            ;  1
          jne @B                ;  3 when jmp taken
        @@:
    A straight cycle count shows that for every character compared in the first
    loop, it takes theoretically 6 cycles where in the unrolled loop you get 2
    characters compared for 10 cycles which is theoretically 5 cycles for each
    comparison.

    You can actually make this loop shorter but it does not run as fast,
    Code:
        @@:
          inc eax               ;  1
          cmp [eax-1], dh       ;
          je @F                 ;  1 when not taken
          inc eax               ;  1
          cmp [eax-1], dh       ;
          jne @B                ;  3 when jmp taken
        @@:
    Comparing the memory operand [eax-1] is slower than using the RISC approach
    of moving the address into a register first then doing the comparison.

    The advantage when you are doing more complex things than a simple byte scanner
    is that you have a longer loop where you have more options with instruction
    re-ordering.

    Regards,

    [email protected]

    ------------------

    Leave a comment:


  • Bern Ertl
    replied
    Steve,

    Thanks. Your explanation is clear.

    Even with the stalls, a couple of back to back !lea instructions
    still clock faster than a !mul (or !imul).

    Any idea if there is any (performance) difference between:

    !lea eax, [eax+eax]

    vs.

    !lea eax, [eax*2]



    ------------------
    Bernard Ertl

    Leave a comment:


  • Steve Hutchesson
    replied
    Bern,

    Its not an easy question to answer in a hurry but the general outline is that
    the pipeline in a later Intel processor is a 5 stage production line. While
    any single instruction takes 5 cycles at best to go through the pipeline, because
    instructions are sequenced one after another, the instruction completion rate
    is then about 1 cycle in theory.

    Dual pipelines theoretically allow 2 instruction to pass through in the time
    it takes to process one but there are a very complicated set of rules to get
    this to happen reliably and it cannot always be done. Intel introduced by
    stealth the idea of a preferred subset of instructions that pair better through
    the pipeline and this is more or less a "Reduced Instruction Set Computer"
    concept, RISC.

    Backwards compatibility allows all of the old instructions but some of them
    are very slow in comparison to the small preferred instructions. AGI means
    "Address Grid Interlock" and it means effectively that a register that is within
    the pipeline cannot be accessed by another instruction that is entering the
    pipeline. The instruction that requires the register that is already in the
    pipeline has to wait until the first one is retired so it "STALLS" until
    this happens.

    This very seriously slows code down and while it just does not matter in many
    cases, if you have a very small inner loop that repeatedly uses the same
    register, there will be some delay on each iteration. LEA can be a culprit
    because it can multiply reference a register,
    Code:
        ! lea eax, [eax*eax+1]
    The best around is Agner Fog's optimisation manual which is on my home site at
    http://www.pbq.com.au/home/hutch/agner.htm

    I have produced the winhelp version for him as it is very good information
    to have available when chasing speed where you need it.

    Regards,

    [email protected]


    ------------------

    Leave a comment:


  • Bern Ertl
    replied
    Steve,

    Thanks a bunch for this post. I've been using this trick quite
    a bit!

    Could you elaborate a little on what an "AGI stall" is?

    BTW, it is possible to hard code almost any multiple (other than
    prime #'s > 5) using just one register.

    Code:
    ' ! lea eax, [eax + eax]      
    ' ! lea eax, [eax*2+eax]  ; x 6
    
    
    ' ! lea eax, [eax*4 + eax]      
    ' ! lea eax, [eax+eax]    ; x 10


    ------------------
    Bernard Ertl

    Leave a comment:


  • Colin Schmidt
    replied
    Sorry, just playing... But interesting...

    Code:
    Loop of 200,000,000
    
    	loop overhead
    		3.02
    		3.02
    		3.02
    		3.02
    
    	l = l * 10
    		5.37 - 3.02 = 2.35
    		5.38 - 3.02 = 2.36
    		5.44 - 3.02 = 2.42
    		5.44 - 3.02 = 2.42	
    
    	! mov eax, var
            ! lea ecx, [eax*2]
            ! lea eax, [eax*8+ecx]
    		4.34 - 3.02 = 1.32
    		4.23 - 3.02 = 1.21
    		4.23 - 3.02 = 1.21
    		4.29 - 3.02 = 1.27
    Colin Schmidt

    ------------------
    Colin Schmidt & James Duffy, Praxis Enterprises, Canada
    [email protected]

    Leave a comment:


  • Steve Hutchesson
    started a topic asm multiply tricks

    asm multiply tricks

    The old MUL instruction is very slow on modern pentiums so if
    you have to hard code a multiplication into an algorithm, there
    is an alternative way using LEA which is much faster.

    Code:
          ! mov eax, var
      
          ' ! lea eax, [eax+eax]    ; x 2
          ' ! lea eax, [eax*2+eax]  ; x 3
          ' ! lea eax, [eax*4]      ; x 4
          ' ! lea eax, [eax*4+eax]  ; x 5
      
          ' ! lea ecx, [eax*2]
          ' ! lea eax, [eax*4+ecx]  ; x 6
      
          ' ! lea ecx, [eax*2+eax]
          ' ! lea eax, [eax*4+ecx]  ; x 7
      
          ' ! lea eax, [eax*8]      ; x 8
      
          ' ! lea eax, [eax*8+eax]  ; x 9
      
          ' ! lea ecx, [eax*2]
          ' ! lea eax, [eax*8+ecx]  ; x 10
    LEA is often a source of AGI stalls but even so, it is a lot
    faster than MUL.

    Regards,

    [email protected]

    ------------------
Working...
X