>but I think that ADD EAX, EAX would be preferred as
it is generally faster.
I found this thread whilst searching for the fastest way to X by 2.
Is add faster than shl too?
Announcement
Collapse
No announcement yet.
asm multiply tricks
Collapse
X
-
John,
Thanks for posting these results, it shows some hardware difference to
the PIII 600 I am working on. They are interesting results for a number
of reasons, the timings look very much like the type of spacing that
you would get on a 486 but much faster with the 650 Duron.
the reg, [mem] compare is in line with 486 optimisation, the extra step
of loading the register before doing the comparison appears to be taking
the extra time of another instruction.
Your 1st algo with the extra jump is slightly slower but surprisingly
enough the SCAS instruction has performed a lot better than it would on
a later Intel processor.
Rough guess is that the Duron has a shorter pipeline and faster fetch on
memory operands but is slower on small instructions so its optimised code
type would favour shorter instruction paths and less expansion into small
instructions.
Regards,
[email protected]
------------------
Leave a comment:
-
Hutch
finally had some time to do some testing on an AMD Duron 650.
I loaded 99999999 bytes of a 100000000 byte array with non zero values and zero in the last byte. I then tested 4 different methods. 2 you described ie (1) cmp [eax - 1], dh (2) cmp dl, dh (3) my prior post with an unconditional jump and (4) repne scasb. I looped through each test 100 times
The results are surprising.
(1) 58.88 secs
(2) 65.72 secs
(3) 61.21 secs
(4) 63.22 secs
The code for the repne scasb follows, whilst not the fastest it has the added safety factor of always terminating.
Code:T3! = TIMER ! mov ebx, 100 r4: ! MOV eax, 0 ! mov ecx, 100000001 ! MOV edi, Sa??? ! CLD ! repne scasb ! dec ebx ! jnz r4 T4! = TIMER
------------------
Leave a comment:
-
John,
I agree with you that the L1 cache is a lot of the reason for
variation between processor versions but unless you wish to try
processor specific methods like strip mining which will vary
considerably from one processor to another, I don't think there
is a lot that can be done about it.
The question of register variable is one that does not bother me
as I write a reasonable amount of code that does not use inline
assembler and generally the automatic allocation of registers to
variables works OK.
Having the capacity to turn register optimisation off does the job
fine where you need to code the entire function yourself in assembler.
The code you posted looks unusual in that it has an extra unconditional
jump but the problem I have found where I have to do byte scanners
is that comparing 2 registers is almost exclusively faster that
comparing a register to a memory operand.
Code:@@: inc eax ; 1 cmp [eax-1], dh ; je @F ; 1 when not taken inc eax ; 1 cmp [eax-1], dh ; jne @B ; 3 when jmp taken @@:
Code:Lp: ! cmp [eax], dl ! je Lend ! inc eax ! cmp [eax], dl ! je Lend ! inc eax ! jmp Lp Lend:
which will generally make the loop slower.
I would be interested to see what you tested this loop on.
Regards,
[email protected]
------------------
Leave a comment:
-
Hutch
In general I agree with you, however I think the following construct would be even quicker.
Code:comp [eax], dl inc eax
Code:! mov dl, 0 ! mov eax, startaddress Lp: ! cmp [eax], dl ! je Lend ! inc eax ! cmp [eax], dl ! je Lend ! inc eax ! jmp Lp Lend:
------------------
Leave a comment:
-
Tom,
I think you have a good point here, I own a K6 550 AMD as a backup box that
has the fast LOOP instruction that required a patch in win95b so that Windows
would run.
Pipeline technology started in PCs with the 486, dual pipelines in early
pentiums upwards and similar technology is being used by AMD in their later
processors, Athlon etc ..
The PII and PIII have very similar architecture so much of what works on one
works on the other OK. It is when you specifically target a narrow feature of
one processor that you pay a penalty on others. Writing XMM restricts you to
PIII and later Intel processors, MMX rules out early pentiums and 486
processors so if you want you software to run on processors from 486 upwards,
you are basically stuck with using the mixed integer instructions and
standard floating point range.
With the example, LEA had a serious workover in the 486 so most of the usage
of LEA is valid from 486 up. Unrolling a loop is basic logic, do more in each
iteration of a loop and you reduce the overhead in many instances.
The problem is that reverting back to pre 486 code design does not help, the
technology for pre pipeline processors is well understood and it generally
meant chosing the shortest set of instructions to do the job, the string
instructions are a good example of this, they area shorter instructions and
they run faster on a 386 and lower but "who cares" when you are writing code
that requires the instruction range from 486 upwards.
Instruction like LOOP/LOOPNZ JCXZ are leftover 386 technology that are a lot
slower on a 486 upwards than the RISC style replacements. The exceptions are
REP MOVSD and REP STOSD which seem to be well optimised on later pentium
processors.
You can often get smaller code by the pre 486 indexing techniques and often
it will run faster but there are some notable exceptions, the example I cited
above,
Code:mov dl, [eax] ; 1 inc eax ; 1 cmp dl, dh ; 1
Code:inc eax ; 1 cmp [eax-1], dh
stick to the most general set of characteristics, everyone is faced with this
choice from compiler writer to programmers doing specific targetting in a
program and as the reference material is much the same to all interested
parties, you are faced with leaving the old DOS stuff behind and modernising
code unless you want it to run badly on all post 486 processors.
Regards,
[email protected]
------------------
Leave a comment:
-
Is this discussion applicable to Pentium-class chips in general, or only to specific
models and, if so, which?
I've seen a lot of "optimizations" where, say, the new code worked twice as fast on a
Pentium II, half as fast on a Pentium III, and exactly the same on an AMD K6. This sort
of thing leaves me with serious questions about the value of trying to optimize at the
opcode level unless you have a very specific target machine in mind.
------------------
Tom Hanlin
PowerBASIC Staff
Leave a comment:
-
Bern,
From the literature I have seen, LEA EAX, [EAX+EAX] is internally simpler
than LEA EAX, [EAX*2] but I think that ADD EAX, EAX would be preferred as
it is generally faster.
When you get to the stage where you are clocking the main loop after you
have it all running properly, you can think in terms of spacing back to
back LEA instructions with some of the increments that may be performed
in the LEA.
Code:LEA --- ADD --- INC --- LEA ---
have enough other things to do in the loop.
Another trick if the code design allows it is to "unroll" the loop. This
means if you have a simple short loop that appears to be suffering from
loop overhead, if you double what is being done in the loop, you half the
loop overhead.
This MASM code is a simple byte scanner to find a zero string terminator.
Code:@@: mov dl, [eax] ; 1 inc eax ; 1 cmp dl, dh ; 1 jne @B ; 3 when jmp taken
in how fast it runs. Unrolling it by a factor of 2 reduces the loop overhead
and it clocks faster.
Code:@@: mov dl, [eax] ; 1 inc eax ; 1 cmp dl, dh ; 1 je @F ; 1 when not taken mov dl, [eax] ; 1 inc eax ; 1 cmp dl, dh ; 1 jne @B ; 3 when jmp taken @@:
loop, it takes theoretically 6 cycles where in the unrolled loop you get 2
characters compared for 10 cycles which is theoretically 5 cycles for each
comparison.
You can actually make this loop shorter but it does not run as fast,
Code:@@: inc eax ; 1 cmp [eax-1], dh ; je @F ; 1 when not taken inc eax ; 1 cmp [eax-1], dh ; jne @B ; 3 when jmp taken @@:
of moving the address into a register first then doing the comparison.
The advantage when you are doing more complex things than a simple byte scanner
is that you have a longer loop where you have more options with instruction
re-ordering.
Regards,
[email protected]
------------------
Leave a comment:
-
Steve,
Thanks. Your explanation is clear.
Even with the stalls, a couple of back to back !lea instructions
still clock faster than a !mul (or !imul).
Any idea if there is any (performance) difference between:
!lea eax, [eax+eax]
vs.
!lea eax, [eax*2]
------------------
Bernard Ertl
Leave a comment:
-
Bern,
Its not an easy question to answer in a hurry but the general outline is that
the pipeline in a later Intel processor is a 5 stage production line. While
any single instruction takes 5 cycles at best to go through the pipeline, because
instructions are sequenced one after another, the instruction completion rate
is then about 1 cycle in theory.
Dual pipelines theoretically allow 2 instruction to pass through in the time
it takes to process one but there are a very complicated set of rules to get
this to happen reliably and it cannot always be done. Intel introduced by
stealth the idea of a preferred subset of instructions that pair better through
the pipeline and this is more or less a "Reduced Instruction Set Computer"
concept, RISC.
Backwards compatibility allows all of the old instructions but some of them
are very slow in comparison to the small preferred instructions. AGI means
"Address Grid Interlock" and it means effectively that a register that is within
the pipeline cannot be accessed by another instruction that is entering the
pipeline. The instruction that requires the register that is already in the
pipeline has to wait until the first one is retired so it "STALLS" until
this happens.
This very seriously slows code down and while it just does not matter in many
cases, if you have a very small inner loop that repeatedly uses the same
register, there will be some delay on each iteration. LEA can be a culprit
because it can multiply reference a register,
Code:! lea eax, [eax*eax+1]
http://www.pbq.com.au/home/hutch/agner.htm
I have produced the winhelp version for him as it is very good information
to have available when chasing speed where you need it.
Regards,
[email protected]
------------------
Leave a comment:
-
Steve,
Thanks a bunch for this post. I've been using this trick quite
a bit!
Could you elaborate a little on what an "AGI stall" is?
BTW, it is possible to hard code almost any multiple (other than
prime #'s > 5) using just one register.
Code:' ! lea eax, [eax + eax] ' ! lea eax, [eax*2+eax] ; x 6 ' ! lea eax, [eax*4 + eax] ' ! lea eax, [eax+eax] ; x 10
------------------
Bernard Ertl
Leave a comment:
-
Sorry, just playing... But interesting...
Code:Loop of 200,000,000 loop overhead 3.02 3.02 3.02 3.02 l = l * 10 5.37 - 3.02 = 2.35 5.38 - 3.02 = 2.36 5.44 - 3.02 = 2.42 5.44 - 3.02 = 2.42 ! mov eax, var ! lea ecx, [eax*2] ! lea eax, [eax*8+ecx] 4.34 - 3.02 = 1.32 4.23 - 3.02 = 1.21 4.23 - 3.02 = 1.21 4.29 - 3.02 = 1.27
------------------
Colin Schmidt & James Duffy, Praxis Enterprises, Canada
[email protected]
Leave a comment:
-
asm multiply tricks
The old MUL instruction is very slow on modern pentiums so if
you have to hard code a multiplication into an algorithm, there
is an alternative way using LEA which is much faster.
Code:! mov eax, var ' ! lea eax, [eax+eax] ; x 2 ' ! lea eax, [eax*2+eax] ; x 3 ' ! lea eax, [eax*4] ; x 4 ' ! lea eax, [eax*4+eax] ; x 5 ' ! lea ecx, [eax*2] ' ! lea eax, [eax*4+ecx] ; x 6 ' ! lea ecx, [eax*2+eax] ' ! lea eax, [eax*4+ecx] ; x 7 ' ! lea eax, [eax*8] ; x 8 ' ! lea eax, [eax*8+eax] ; x 9 ' ! lea ecx, [eax*2] ' ! lea eax, [eax*8+ecx] ; x 10
faster than MUL.
Regards,
[email protected]
------------------
Tags: None
Leave a comment: