Announcement

Collapse
No announcement yet.

Array manipulation in PB inline asm

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Array manipulation in PB inline asm

    I had a need recently to be able to increment the values in a LONG
    array for a byte frequency counter which involved using the ascii
    value of each byte as the array index and incrementing the correct
    array element as each byte was read.

    What I was after was a normal PowerBASIC LONG array at the output
    end that had the frequency count of all 256 characters loaded in
    the array.

    I originally wrote a byte reader in assembler and in the middle of
    the loop, I used "var&(index) = var&(index) + 1" which did the job
    OK but was not really fast enough for the task that I have to do so
    after making enquiries, I was stuck with having to code a direct
    access algorithm in assembler to do the job.

    When I had a reasonable algo running, I passed it by Tom Hanlin
    because of Tom's very extensive experience in library and
    algorithm design and Tom sent me back a different design that was an
    optimisation based on a different criterion, it was a size based
    optimisation that was a lot easier to read than the paired pentium
    design that I had worked on.

    As I have both versions working, I thought it reasonable to post both of
    them as there have been some questions raised in the past about working
    with arrays in assembler in PowerBASIC.

    The main advantage is that PowerBASIC arrays are linear addressed in
    memory which makes direct access both possible and fast.

    The following is the basic form that Tom suggested,

    Code:
               redim cnt&(0 to 255)
      
               lpArr = VarPtr(cnt&(0)) ' array data address
      
               src = StrPtr(a$)        ' source string address
      
               ! mov ecx, fl           ; file length in ecx
               ! mov esi, src          ; address of string in esi
               ! xor eax,eax           ; clear eax
               ! cld                   ; set load direction forward
           cntSt2:
               ! lodsb                 ; get byte and move esi forward by one byte
               ! mov edx, lpArr        ; array address
               ! inc dword ptr [edx+eax*4]    ; increment value at array element eax
               ! loop cntSt2           ; keep going for length of file
    It is clear to read, small in its implementation and on the box I am
    using, it reads the 90 odd meg test file in 2377 ms which is reasonably
    fast for a size optimised algorithm.

    I could not resist having a play with it and replacing "lodsb" and the
    slow instruction "loop" as follows.

    Code:
               ! mov ecx, fl           ; 1 file length in ecx
               ! mov esi, src          ; 1 address of string in esi
               ! add ecx, esi
               ! xor eax,eax           ; clear eax
           cntSt2:
               ! mov al, [esi]
               ! inc esi
               ! mov edx, lpArr        ; 1 array address
               ! inc dword ptr [edx+eax*4]    ; increment value at array element eax
               ! cmp esi, ecx
               ! jne cntSt2
    This version is not as clear to read and is slightly larger as well but it
    is about a half a second faster on the 90 odd meg test file at 1800 ms. It
    gets the loop count by adding the address of the source to the file length
    so that the exit condition can be tested with the "! cmp esi, ecx" line.

    The version that I developed before Tom passed me his size optimised
    design is both larger and only partially intelligible and it is a speed
    optimised design based on instruction pairing for pentium processors.

    Code:
                redim cnt&(0 to 255)
      
                lpArr = VarPtr(cnt&(0)) ' array data address
      
                src = StrPtr(a$)        ' source string address
      
                ! mov ecx, fl           ; file length in ecx
                ! mov esi, src          ; address of string in esi
                ! add ecx, esi          ; put finish address in ecx
              cntSt:
                ! mov al, [esi]         ; get byte
                ! movzx ebx, al         ; zero extend al into ebx
                ! mov edx, lpArr        ; array address
                ! lea edx, [edx+ebx*4]  ; address of edx + ebx * 4
                ! mov ebx, [edx]        ; put value at address into ebx
              ' --------------------------------------------------------------  
                ! inc ebx
              ' --------------------------------------------------------------
                ! mov [edx], ebx        ; put modified value back at address
                ! add esi, 1
                ! cmp esi, ecx
                ! jne cntSt
    This algorithm is written with an eye to further reuse with the gate either
    side of the "! inc ebx" line so that different operations can be performed
    on the accessed array element. With this algo, if you need to perform
    high level operations in this gate, you will need to preserve both ECX & EDX,
    EAX does not need it and the called functions have the responsibility of
    preserving ESI and EBX. If you remove the core of the algo from the byte
    reader, you need to put the array index into EBX.

    Because of the considerations of instruction pairing, instruction order has
    been changed where necessary. The instruction that did the damage in terms
    of speed was "LEA" as it has the capacity to do the addition for the offset
    calculation which save a following "ADD" instruction and it is one of the
    fast opcodes that pairs well.

    This version is substantially faster than the size optimised ones running
    the 90 odd meg test file in 1120 ms.

    With special thanks to Tom Hanlin for his contribution of an intelligible
    size optimised algorithm, a few small pieces of pure PowerBASIC for your
    pleasure.

    Regards,

    [email protected]
    hutch at movsd dot com
    The MASM Forum

    www.masm32.com
Working...
X