Somewhere I have seen code for doing memory copy using floating point
instructions but they use the same registers as MMX so it does not seem
to be any advantage.
I can't lay my hands on the test piece at the moment but it is an unrolled
loop of this type,
Code:
! mov esi, src ! mov edi, dst mmSt: ! movq mm(0), [esi] ! movq mm(1), [esi + 8] ! movq mm(2), [esi + 16] ! movq mm(3), [esi + 24] ! movq mm(4), [esi + 32] ! movq mm(5), [esi + 40] ! movq mm(6), [esi + 48] ! movq mm(7), [esi + 56] ! movq [edi], mm(0) ! movq [edi + 8], mm(1) ! movq [edi + 16], mm(2) ! movq [edi + 24], mm(3) ! movq [edi + 32], mm(4) ! movq [edi + 40], mm(5) ! movq [edi + 48], mm(6) ! movq [edi + 56], mm(7)
about 3 - 4 % slower than REP MOVSD and I tried this is after instruction
re-ordering to maximise its loop speed and it has no pairing problems and no
stalls.
From all of the technical data I have seen and from my own testing, REP MOVSD
is well optimised in the PII - PIII processor range but I have also run into
the technical data that the speed of the physical memory is the limiting factor
in memory copy and my testing appears to bear this out, all of the algorithms
I have tested come within about 5% of each other, even though the MMX version
should be a lot faster.
Regards,
[email protected]
Code:
; ######################################################################### srCopy proc src :DWORD, dst :DWORD, ln :DWORD LOCAL cntr :DWORD push ebx push esi push edi mov esi, src mov edi, dst cmp ln, 16 jb ShortLoop mov eax, ln shr eax, 4 mov cntr, eax @@: mov eax, [esi] mov [edi], eax mov ebx, [esi+4] mov [edi+4], ebx mov ecx, [esi+8] mov [edi+8], ecx mov edx, [esi+12] mov [edi+12], edx add esi, 16 add edi, 16 dec cntr jnz @B and ln, 15 ShortLoop: mov al, [esi] inc esi mov [edi], al inc edi dec ln jns ShortLoop pop edi pop esi pop ebx ret srCopy endp ; #########################################################################

[This message has been edited by Steve Hutchesson (edited March 26, 2001).]
Leave a comment: