array for a byte frequency counter which involved using the ascii
value of each byte as the array index and incrementing the correct
array element as each byte was read.
What I was after was a normal PowerBASIC LONG array at the output
end that had the frequency count of all 256 characters loaded in
the array.
I originally wrote a byte reader in assembler and in the middle of
the loop, I used "var&(index) = var&(index) + 1" which did the job
OK but was not really fast enough for the task that I have to do so
after making enquiries, I was stuck with having to code a direct
access algorithm in assembler to do the job.
When I had a reasonable algo running, I passed it by Tom Hanlin
because of Tom's very extensive experience in library and
algorithm design and Tom sent me back a different design that was an
optimisation based on a different criterion, it was a size based
optimisation that was a lot easier to read than the paired pentium
design that I had worked on.
As I have both versions working, I thought it reasonable to post both of
them as there have been some questions raised in the past about working
with arrays in assembler in PowerBASIC.
The main advantage is that PowerBASIC arrays are linear addressed in
memory which makes direct access both possible and fast.
The following is the basic form that Tom suggested,
Code:
redim cnt&(0 to 255) lpArr = VarPtr(cnt&(0)) ' array data address src = StrPtr(a$) ' source string address ! mov ecx, fl ; file length in ecx ! mov esi, src ; address of string in esi ! xor eax,eax ; clear eax ! cld ; set load direction forward cntSt2: ! lodsb ; get byte and move esi forward by one byte ! mov edx, lpArr ; array address ! inc dword ptr [edx+eax*4] ; increment value at array element eax ! loop cntSt2 ; keep going for length of file
using, it reads the 90 odd meg test file in 2377 ms which is reasonably
fast for a size optimised algorithm.
I could not resist having a play with it and replacing "lodsb" and the
slow instruction "loop" as follows.
Code:
! mov ecx, fl ; 1 file length in ecx ! mov esi, src ; 1 address of string in esi ! add ecx, esi ! xor eax,eax ; clear eax cntSt2: ! mov al, [esi] ! inc esi ! mov edx, lpArr ; 1 array address ! inc dword ptr [edx+eax*4] ; increment value at array element eax ! cmp esi, ecx ! jne cntSt2
is about a half a second faster on the 90 odd meg test file at 1800 ms. It
gets the loop count by adding the address of the source to the file length
so that the exit condition can be tested with the "! cmp esi, ecx" line.
The version that I developed before Tom passed me his size optimised
design is both larger and only partially intelligible and it is a speed
optimised design based on instruction pairing for pentium processors.
Code:
redim cnt&(0 to 255) lpArr = VarPtr(cnt&(0)) ' array data address src = StrPtr(a$) ' source string address ! mov ecx, fl ; file length in ecx ! mov esi, src ; address of string in esi ! add ecx, esi ; put finish address in ecx cntSt: ! mov al, [esi] ; get byte ! movzx ebx, al ; zero extend al into ebx ! mov edx, lpArr ; array address ! lea edx, [edx+ebx*4] ; address of edx + ebx * 4 ! mov ebx, [edx] ; put value at address into ebx ' -------------------------------------------------------------- ! inc ebx ' -------------------------------------------------------------- ! mov [edx], ebx ; put modified value back at address ! add esi, 1 ! cmp esi, ecx ! jne cntSt
side of the "! inc ebx" line so that different operations can be performed
on the accessed array element. With this algo, if you need to perform
high level operations in this gate, you will need to preserve both ECX & EDX,
EAX does not need it and the called functions have the responsibility of
preserving ESI and EBX. If you remove the core of the algo from the byte
reader, you need to put the array index into EBX.
Because of the considerations of instruction pairing, instruction order has
been changed where necessary. The instruction that did the damage in terms
of speed was "LEA" as it has the capacity to do the addition for the offset
calculation which save a following "ADD" instruction and it is one of the
fast opcodes that pairs well.
This version is substantially faster than the size optimised ones running
the 90 odd meg test file in 1120 ms.
With special thanks to Tom Hanlin for his contribution of an intelligible
size optimised algorithm, a few small pieces of pure PowerBASIC for your
pleasure.
Regards,
[email protected]