Announcement

Collapse
No announcement yet.

Algorithm Challenge

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Kevin,

    Are there any specific string lengths you're checking?

    Does the solution has to allow any string length?

    I'm still pondering how to go faster ...

    Comment


    • #22
      Gary,

      I am comparing arrays of long integers...in one case 16 elements long many times over, and in the second case the string is 300 elements long with less frequency.

      I don't see any way to improve on our code. Your code worked and you did it fast...i gave myself a headache writing mine, but it works and is fiarly fast...but in this application (vision processing) speed matters.

      Threading would be useful here to break the task up, but that always gives me a bigger headache...but may be unavoidable.

      Comment


      • #23
        The same basic mistakes are made over and over again in optimising code.

        Do not append things to strings. It's very slow.
        In Gary's program above, over 90% of the time is spent doing this:
        Code:
        Matches$ = Matches$ + " " + STR$(i+sLength)
        which competely masks the time used doing the comparisons.

        When you have many nested loop you really need to look closely at what goes on in the innermost loop as it is usually executed far more than another code and it's therefore more important to optimise that bit.
        Again picking on Gary's code for convenience, the inner loop is
        Code:
                    For j = i To sLength
                        pATemp = pStartA - i + j
                        pBTemp = pStartB + j - 1
                        If @pATemp <> @pBTemp Then
                            iNotMatchFlag = 1
                            Exit For
                        End If
                    Next j
        every time around the loop there are unneeded assignments and calculations. They can all be done away with by rearranging things as follows:
        Code:
                    pATemp = pStartA - i
                    pBTemp = pStartB - 1
                    FOR j = i TO sLength
                        IF @pATemp[j] <> @pBTemp[j] THEN
                            iNotMatchFlag = 1
                            EXIT FOR
                        END IF
                    NEXT j
        Look at the inner loop now, there's very little to it, a compare and 1 assignment.
        Previously there was an additional 2 assignments, 2 additions and 2 subtractions. That's 3 times the work

        You can also gain a bit by making the loop count down instead of up.

        Another bonus is that when the inner loop is simplified it's then easier to convert it to ASM for an additional boost in speed.


        Also, be careful how you time it. The following is from Microsoft on GetTickCount:
        The resolution of the GetTickCount function is limited to the resolution of the system timer, which is typically in the range of 10 milliseconds to 16 milliseconds
        Paul.

        Comment


        • #24
          It's okay Paul - Rule #1 of posting is:

          "By posting in these forums, you give the right to any/all participants to discuss, dissect or otherwise expose the faults of your posted materials."
          Posting in these forums is not for the faint of heart or thin of skin.

          In the end, it's what we learn that counts.

          Comment


          • #25
            No Sweat Gary...I appreciate your efforts...since my scenario uses long ints and does not exit the loops when a match occurs the comments from Paul are not relevant for what I need.

            Comment


            • #26
              Oh, wait ... I claim "Good Samaritan" protection. The law in Texas says that if you stop to help someone and they accept your help, then you cannot be held liable or sued for the results of your actions.



              Does Texas law hold for the wild, wild Internet?

              Comment


              • #27
                I wondered if it was possible to get much speed improvement with asm in just the basic long array comparison, which is the innermost loop. So I did an mmx version to find out. I didn't think it would be much faster, but to my surprise it is well over twice as fast on my machine. I got ~ 2.7x on avg. Maybe it would be worth trying to incorporate it. This is CC5 ver.

                Changed: removed mmx pand mask
                Code:
                #COMPILE EXE
                #DIM ALL
                #REGISTER NONE
                                           
                %ARRAYSIZE = 300
                %COUNTCONST = %ARRAYSIZE \ 2 - 1
                
                FUNCTION PBMAIN () AS LONG
                    LOCAL a1ptr, a2ptr, arrTot AS LONG, t, t2, tTot, t2tot, addMask AS QUAD
                    LOCAL asmTot, clrS AS LONG
                    REGISTER countr AS LONG
                
                    DIM arr1(%ARRAYSIZE) AS LONG
                    DIM arr2(%ARRAYSIZE) AS LONG
                    a1ptr = VARPTR(arr1(1))  'we'll not use the zero element
                    a2ptr = VARPTR(arr2(1))
                
                   DO
                    FOR countr = 1 TO %ARRAYSIZE
                       arr1(countr) = RND(2243, 2246)
                       arr2(countr) = RND(2243, 2246)
                    NEXT
                
                    arrTot = 0
                    TIX t2
                    FOR countr = 1 TO %ARRAYSIZE
                       IF arr1(countr) = arr2(countr) THEN INCR arrTot
                    NEXT
                    TIX END t2
                    t2tot += t2
                
                    TIX t
                    !mov countr, %COUNTCONST     ;1/2 %ARRAYSIZE because we compare quads rather than longs
                          
                    !mov eax, a1ptr              ;arr1
                    !mov ecx, a2ptr              ;arr2
                    !mov edx, countr
                    !pxor mm2, mm2               ;need to clear mmx registers from previous loops
                   top:
                    !movq    mm0, [eax+edx*8]    ;2 arr1 elements
                    !movq    mm1, [ecx+edx*8]    ;2 arr2 elements
                    !pcmpeqd mm1, mm0            ;compare if equal
                    !paddd   mm2, mm1            ;add to running total register
                    !sub edx, 1
                    !jae short top
                
                    !movq    mm0, mm2            ;copy mm2
                    !psrlq   mm0, 32             ;prepare to add hi and lo dwords of mm2 to get total matches
                    !paddd   mm0, mm2            ;do the add
                    !movd    asmTot, mm0         ;save in asmTot
                    !neg     asmTot
                    !emms                        ;empty mmx machine state--clear fpu
                
                    TIX END t
                    tTot += t
                    IF (clrS AND 255) = 0 THEN ? STR$(t2tot / tTot, 2);
                    IF asmTot <> arrTot THEN ? "NOT EQUAL!": WAITKEY$
                    INCR clrS
                   LOOP
                                
                END FUNCTION
                Last edited by John Gleason; 13 Nov 2009, 11:11 AM. Reason: removed mmx pand mask

                Comment


                • #28
                  Hi John,

                  Thanks for the ASM. Not being an ASM coder, I have been unable to integrate your code into mine. You'll note in my inner loop (The first DO loop) that I compare the yth element to the zth element. Your code does not account for the decrementing of the zth element. I tried putting it into a separate register and decrementing it along with y (in your code "y" is equivalent to the countr variable I think), but I am getting an infinite loop...

                  I think ASM would make major improvements in my code because this is a big bottleneck in the image processing...

                  Kevin

                  Comment


                  • #29
                    I just used strings in my example because it was a bit easier to show what I needed using strings, since the core algorithm remains the same in either case
                    No, that's not correct at all.

                    Optimization is application-specific; ergo your 'core algorithm' may not - and probably will not - be the same for different applications. And considering your tool of choice - the PB compiler - contains its own optimizations for LONG integers, your choice of implenentations will surely vary in the strings-v-integers scenario.

                    What works well in one application does not necessarily work well in another.

                    And I do believe we learned only in your last post this is some kind of image-processing application; I have to believe 'somewhere out there' are algorithms specifically designed to assist with image-processing... perhaps some approach to do exactly what you really want to do but is not based on "compare two strings in a sliding fashion."

                    MCM
                    Michael Mattias
                    Tal Systems (retired)
                    Port Washington WI USA
                    [email protected]
                    http://www.talsystems.com

                    Comment


                    • #30
                      John ASM

                      Hi John,

                      Below is my attmept to integrate your code:

                      Code:
                          overlap1 = %rawInpLength: overlap2 = 1
                          z = %rawInpLength
                          y = 1
                          DO
                               DO WHILE z >= overlap1
                                    'DO WHILE y >= 1
                                        '''''Do comparison here
                                        
                                         !mov y, %COUNTCONST     ;1/2 %ARRAYSIZE because we compare quads rather than longs
                                         !mov dword ptr addMask[0], 1 ;&h0000000100000001 in addMask to mask a max &hffffffffffffffff
                                         !mov dword ptr addMask[4], 1 ;            "               "                   "
                      
                                         !mov eax, a1ptr              ;arr1
                                         !mov ecx, a2ptr              ;arr2
                                         !mov edx, y
                                         !mov ebx, z
                                         !movq mm3, addMask           ;pre-load
                                         !pxor mm2, mm2               ;need to clear mmx registers from previous loops
                                        top:
                                         !movq    mm0, [eax+edx*8]    ;2 arr1 elements
                                         !movq    mm1, [ecx+ebx*8]    ;2 arr2 elements
                                         !pcmpeqd mm1, mm0            ;compare if equal
                                         !pand    mm1, mm3            ;mask
                                         !paddd   mm2, mm1            ;add to running total register
                                         !sub ebx, 1
                                         !sub edx, 1
                                         !jae short top
                      
                                         !movq    mm0, mm2            ;copy mm2
                                         !psrlq   mm0, 32             ;prepare to add hi and lo dwords of mm2 to get total matches
                                         !paddd   mm0, mm2            ;do the add
                                         !movd    counter, mm0         ;save in asmTot
                                         !emms                        ;empty mmx machine state--clear fpu
                                        
                      
                                        'IF temp1(z) = temp2(y) THEN INCR counter
                                        'z = z - 1
                                        'y = y - 1
                                    'WEND
                               WEND
                               DECR overlap1
                               INCR overlap2
                               z = %rawInpLength
                               y = overlap2
                      
                          LOOP WHILE overlap2 <= %rawInpLength-1

                      Comment


                      • #31

                        Optimization is application-specific; ergo your 'core algorithm' may not - and probably will not - be the same for different applications. And considering your tool of choice - the PB compiler - contains its own optimizations for LONG integers, your choice of implenentations will surely vary in the strings-v-integers scenario. What works well in one application does not necessarily work well in another.
                        Uh...sure. Thanks for the programming 101 lesson. But in this case it was the core algorithm (the looping mechanisms) I cared about...not the particular efficiency of comparing strings or integers. Unless someone would generously provide ASM code to optimize that particular comparison, as John has.

                        And I do believe we learned only in your last post this is some kind of image-processing application; I have to believe 'somewhere out there' are algorithms specifically designed to assist with image-processing... perhaps some approach to do exactly what you really want to do but is not based on "compare two strings in a sliding fashion."
                        There are thousands of image processing algorithms Michael. The fact that this little bit of code relates to an overall image processing program i am writing does not mean that it relates to image processing "at-large". This code has nothing specific to do with image processing.

                        Comment


                        • #32
                          I'll look at your version Kevin, and see if I can make it go. Yes there is that problem of odd # comparisons, say, comparing a 101 overlap. The mmx code requires an even # of comparisons, so it may need 2 sections to alternate between.

                          On the plus side, I removed the mask "pand" command because it makes no difference if you add -1 to itself in a loop, or +1, as long as you just change the minus sign to a + when you're done! That sped it up like 20%. I'll go back and change the original post to the new code.
                          Code:
                          'new code removing pand mask:
                              !mov countr, %COUNTCONST     ;1/2 %ARRAYSIZE because we compare quads rather than longs
                                    
                              !mov eax, a1ptr              ;arr1
                              !mov ecx, a2ptr              ;arr2
                              !mov edx, countr
                              !pxor mm2, mm2               ;need to clear mmx registers from previous loops
                             top:
                              !movq    mm0, [eax+edx*8]    ;2 arr1 elements
                              !movq    mm1, [ecx+edx*8]    ;2 arr2 elements
                              !pcmpeqd mm1, mm0            ;compare if equal
                              !paddd   mm2, mm1            ;add to running total register
                              !sub edx, 1
                              !jae short top
                          
                              !movq    mm0, mm2            ;copy mm2
                              !psrlq   mm0, 32             ;prepare to add hi and lo dwords of mm2 to get total matches
                              !paddd   mm0, mm2            ;do the add
                              !movd    asmTot, mm0         ;save in asmTot
                              !neg     asmTot
                              !emms                        ;empty mmx machine state--clear fpu

                          Comment


                          • #33
                            Kevin,
                            I agree with Michael. It's far better to state the actual problem rather than make up a similar sounding one. A lot of this thread, and the time of those that contributed, went into trying to figure out exactly what was needed and to optimise the strings out of it when there weren't really any strings involved. A clearer statement of the problem would have helped find a solution more quickly.

                            John,
                            to my surprise it is well over twice as fast on my machine.
                            Why should you be surprised at ASM being faster? It usually is.

                            I haven't tried your code but:
                            1) add #ALIGN 16 just ahead of the top: label. Repeated branch targets should be aligned on a 16 byte boundary for speed
                            2) your loop is very short, it would probably benefit from being unrolled a bit
                            3) MMX instructions like to be done in blocks. If possible, read more than 1 consecutive value into the MMX registers

                            Paul.

                            Comment


                            • #34
                              John,

                              To make things easier, i have put the full code below. The first section is my original algorithm. The second is my attempt to integrate your code in the inner loop. This is running in PB Win 9.

                              Code:
                              #COMPILE EXE
                              
                              
                              %rawInpLength = 30000
                              
                              %ARRAYSIZE = 30000
                              %COUNTCONST = %ARRAYSIZE \ 2 - 1
                              
                              FUNCTION PBMAIN () AS LONG
                              
                                  REGISTER x AS LONG, y AS LONG, z AS LONG
                                  DIM overlap1 AS LONG, overlap2 AS LONG, counter AS LONG
                              
                                  counter = 0
                              
                                  OPEN "speedtest.log" FOR OUTPUT AS #10
                              
                                  DIM temp1(%rawInpLength) AS STRING
                                  DIM temp2(%rawInpLength) AS STRING
                              
                                  FOR x = 1 TO %rawInpLength
                                       temp1(x) = CHR$(RND(65,90))
                                       temp2(x) = CHR$(RND(65,90))
                                  NEXT x
                              
                                  PRINT#10,"test kevin start:";TIMER
                              
                                  overlap1 = %rawInpLength: overlap2 = 1
                                  z = %rawInpLength
                                  y = 1
                                  DO
                                       DO WHILE z >= overlap1
                                            DO WHILE y >= 1
                                                '''''Do comparison here
                              
                                                IF temp1(z) = temp2(y) THEN INCR counter
                                                z = z - 1
                                                y = y - 1
                                            WEND
                                       WEND
                                       DECR overlap1
                                       INCR overlap2
                                       z = %rawInpLength
                                       y = overlap2
                              
                                  LOOP WHILE overlap2 <= %rawInpLength-1
                              
                                    'this section then continues the sliding forward from where the last code finished
                                  FOR z = 0 TO %rawInpLength-1
                              
                                       FOR y = 1 TO %rawInpLength - z
                                           IF temp1(y) = temp2(y+z) THEN INCR counter
                                       NEXT y
                                  NEXT z
                              
                                  PRINT#10,"Kevin original finish:";TIMER, counter
                              
                              
                                  LOCAL a1ptr, a2ptr, arrTot AS LONG, t, t2, tTot, t2tot, addMask AS QUAD
                                  LOCAL asmTot, clrS AS LONG
                              
                                  a1ptr = VARPTR(temp1(1))  'we'll not use the zero element
                                  a2ptr = VARPTR(temp2(1))
                                  
                                  PRINT#10,"test kevin john start:";TIMER
                                  
                                  counter = 0
                              
                                  overlap1 = %rawInpLength: overlap2 = 1
                                  z = %rawInpLength
                                  y = 1
                                  DO
                                       DO WHILE z >= overlap1
                                            'DO WHILE y >= 1
                                                '''''Do comparison here
                                                
                                                 !mov y, %COUNTCONST     ;1/2 %ARRAYSIZE because we compare quads rather than longs
                                                 !mov dword ptr addMask[0], 1 ;&h0000000100000001 in addMask to mask a max &hffffffffffffffff
                                                 !mov dword ptr addMask[4], 1 ;            "               "                   "
                              
                                                 !mov eax, a1ptr              ;arr1
                                                 !mov ecx, a2ptr              ;arr2
                                                 !mov edx, y
                                                 !mov ebx, z
                                                 !movq mm3, addMask           ;pre-load
                                                 !pxor mm2, mm2               ;need to clear mmx registers from previous loops
                                                top:
                                                 !movq    mm0, [eax+edx*8]    ;2 arr1 elements
                                                 !movq    mm1, [ecx+ebx*8]    ;2 arr2 elements
                                                 !pcmpeqd mm1, mm0            ;compare if equal
                                                 !pand    mm1, mm3            ;mask
                                                 !paddd   mm2, mm1            ;add to running total register
                                                 !sub ebx, 1
                                                 !sub edx, 1
                                                 !jae short top
                              
                                                 !movq    mm0, mm2            ;copy mm2
                                                 !psrlq   mm0, 32             ;prepare to add hi and lo dwords of mm2 to get total matches
                                                 !paddd   mm0, mm2            ;do the add
                                                 !movd    counter, mm0         ;save in asmTot
                                                 !emms                        ;empty mmx machine state--clear fpu
                                                
                              
                                                'IF temp1(z) = temp2(y) THEN INCR counter
                                                'z = z - 1
                                                'y = y - 1
                                            'WEND
                                       WEND
                                       DECR overlap1
                                       INCR overlap2
                                       z = %rawInpLength
                                       y = overlap2
                              
                                  LOOP WHILE overlap2 <= %rawInpLength-1
                              
                                    'this section then continues the sliding forward from where the last code finished
                                  FOR z = 0 TO %rawInpLength-1
                              
                                       FOR y = 1 TO %rawInpLength - z
                                           IF temp1(y) = temp2(y+z) THEN INCR counter
                                       NEXT y
                                  NEXT z
                              
                                  PRINT#10,"test kevin john finish:";TIMER
                              
                              
                              
                              END FUNCTION

                              Comment


                              • #35
                                Given the utility of ASM for speed on many common tasks, it seems that a library of ASM routines to perform common tasks would be of great use to the general PB community, especially those not well versed in ASM. Thing like doing string\numeric comparison within an array or within a pair of strings given an index would be useful. I know many of these things are littered around the boards here, but an organized library would be better.

                                In order to "walk the talk", i will be contributing an image processing library soon that has many image processing algorithms implemented. If you are familiar with Open CV, this has a very small subset of the features of openCV, but could be expanded endlessly. This has great utility for robotics and vision applications. this is not in ASM, but the idea of a library of many common application specific features in one DLL is the same.

                                For one cotnribution to such a library, i took one of Pauls ASM routines and with slight modification was able to modify it for copying rgbtriple arrays...i use it *alot*. So I have deep gratitude to Paul among others here.

                                Code:
                                FUNCTION FastCopy2DRGBTripleArray(Arr1() AS rgbtriple, Arr2() AS rgbtriple) EXPORT AS LONG
                                
                                    DIM src AS LONG, dst AS LONG, ln AS LONG
                                
                                      src = VARPTR(Arr1(1,1))
                                      dst = VARPTR(Arr2(1,1))
                                      ln = UBOUND(Arr1,1) * UBOUND(Arr1,2) * 3
                                
                                      ! cld
                                
                                      ! mov esi, src
                                      ! mov edi, dst
                                      ! mov ecx, ln
                                
                                      ! shr ecx, 2
                                      ! rep movsd
                                
                                      ! mov ecx, ln
                                      ! and ecx, 3
                                      ! rep movsb
                                
                                END FUNCTION

                                Comment


                                • #36
                                  >>to my surprise it is well over twice as fast on my machine.
                                  > Why should you be surprised at ASM being faster? It usually is.

                                  So is better BASIC.
                                  Michael Mattias
                                  Tal Systems (retired)
                                  Port Washington WI USA
                                  [email protected]
                                  http://www.talsystems.com

                                  Comment


                                  • #37
                                    Kevin, I realized I wasn't testing "apples with apples" here, so I converted your code to LONG's and tried to simplify it too. Does this look correct as it relates to your algorithm using LONG's?
                                    Code:
                                    #COMPILE EXE
                                    #REGISTER NONE
                                    
                                    %rawInpLength = 10000
                                    
                                    %ARRAYSIZE = %rawInpLength
                                    %COUNTCONST = %ARRAYSIZE \ 2 - 1
                                    
                                    FUNCTION PBMAIN () AS LONG
                                    
                                        LOCAL x AS LONG, y AS LONG, z AS LONG
                                        DIM overlap1 AS LONG, overlap2 AS LONG, counter AS LONG
                                        LOCAL zTemp AS LONG, tq, counter2 AS QUAD
                                    
                                        counter = 0
                                    
                                        OPEN "speedtest.log" FOR OUTPUT AS #10
                                    
                                        DIM temp1(%rawInpLength) AS LONG'STRING
                                        DIM temp2(%rawInpLength) AS LONG'STRING
                                    
                                        FOR x = 1 TO %rawInpLength
                                             temp1(x) = RND(65,90)
                                             temp2(x) = RND(65,90)
                                    '         temp1(x) = CHR$(RND(65,90))
                                    '         temp2(x) = CHR$(RND(65,90))
                                        NEXT x
                                    
                                    '    PRINT#10,"test kevin start:";TIMER
                                    
                                        overlap1 = %rawInpLength
                                        overlap2 = 1
                                    '-------------------------------------------------------------------------------------------------
                                    '-------------------------------------------------------------------------------------------------
                                        FOR overlap2 = 1 TO %rawInpLength-1
                                           z = %rawInpLength
                                           FOR y = overlap2 TO 1 STEP -1
                                              '''''Do comparison here
                                              IF temp1(z) = temp2(y) THEN INCR counter
                                              DECR z
                                           NEXT
                                        NEXT
                                    '-------------------------------------------------------------------------------------------------
                                    '-------------------------------------------------------------------------------------------------
                                    '
                                    '      'this section then continues the sliding forward from where the last code finished
                                        FOR z = 0 TO %rawInpLength-1
                                    
                                             FOR y = 1 TO %rawInpLength - z
                                                 IF temp1(y) = temp2(y+z) THEN INCR counter
                                             NEXT y
                                        NEXT z
                                    
                                        ? "Kevin original finish:" & STR$(TIMER) & STR$(counter)
                                    END FUNCTION
                                    Last edited by John Gleason; 14 Nov 2009, 08:15 AM. Reason: removed unneeded DECR

                                    Comment


                                    • #38
                                      , it seems that a library of ASM routines to perform common tasks would be of great use to the general PB community
                                      Another one?

                                      PB licensees already have such a library: the compiler itself. The compiler uses a library of ASM routines to perform common tasks; these common tasks are requested by the programmer using the BASIC programming language.

                                      Just for the heck of it, please list some 'common tasks' for which the compiler does not provide an optimized - over a period of at least twenty years - routine you can use simply by typing the right words in your BASIC source code.

                                      Three bucks says it's a really short list.

                                      Not that assembly language cannot be a valuable tool, but that's all it is, a tool. A specialized tool.

                                      If you want to write your programs in assembly language, you could always license an assembler. However, that's really underestimating the skill and experience of the PB compiler writers. Good BASIC code and good design will outperform poor assembly-language code seven days per week.


                                      While I often whine about the quality of the documentation, or decry the design of a particular statement or function, or bemoan the lack of some 'obvious' feature, I have never ever gotten anything but fast and efficient compiled executables from the PB compilers.

                                      MCM
                                      Michael Mattias
                                      Tal Systems (retired)
                                      Port Washington WI USA
                                      [email protected]
                                      http://www.talsystems.com

                                      Comment


                                      • #39
                                        Originally posted by Michael Mattias View Post
                                        ... I often whine ... decry ... bemoan ... MCM
                                        I'm shocked.

                                        ===================================
                                        "Never mistake motion for action."
                                        Ernest Hemingway (1899-1961)
                                        ===================================
                                        It's a pretty day. I hope you enjoy it.

                                        Gösta

                                        JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                        LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                        Comment


                                        • #40
                                          >>... I often whine ... decry ... bemoan ...
                                          >I'm shocked

                                          It's a dirty job, but someone has to do it.
                                          Michael Mattias
                                          Tal Systems (retired)
                                          Port Washington WI USA
                                          [email protected]
                                          http://www.talsystems.com

                                          Comment

                                          Working...
                                          X