You are not logged in. You can browse in the PowerBASIC Community, but you must click Login (top right) before you can post. If this is your first visit, check out the FAQ or Sign Up.
I am comparing arrays of long integers...in one case 16 elements long many times over, and in the second case the string is 300 elements long with less frequency.
I don't see any way to improve on our code. Your code worked and you did it fast...i gave myself a headache writing mine, but it works and is fiarly fast...but in this application (vision processing) speed matters.
Threading would be useful here to break the task up, but that always gives me a bigger headache...but may be unavoidable.
The same basic mistakes are made over and over again in optimising code.
Do not append things to strings. It's very slow.
In Gary's program above, over 90% of the time is spent doing this:
Code:
Matches$ = Matches$ + " " + STR$(i+sLength)
which competely masks the time used doing the comparisons.
When you have many nested loop you really need to look closely at what goes on in the innermost loop as it is usually executed far more than another code and it's therefore more important to optimise that bit.
Again picking on Gary's code for convenience, the inner loop is
Code:
For j = i To sLength
pATemp = pStartA - i + j
pBTemp = pStartB + j - 1
If @pATemp <> @pBTemp Then
iNotMatchFlag = 1
Exit For
End If
Next j
every time around the loop there are unneeded assignments and calculations. They can all be done away with by rearranging things as follows:
Code:
pATemp = pStartA - i
pBTemp = pStartB - 1
FOR j = i TO sLength
IF @pATemp[j] <> @pBTemp[j] THEN
iNotMatchFlag = 1
EXIT FOR
END IF
NEXT j
Look at the inner loop now, there's very little to it, a compare and 1 assignment.
Previously there was an additional 2 assignments, 2 additions and 2 subtractions. That's 3 times the work
You can also gain a bit by making the loop count down instead of up.
Another bonus is that when the inner loop is simplified it's then easier to convert it to ASM for an additional boost in speed.
Also, be careful how you time it. The following is from Microsoft on GetTickCount:
The resolution of the GetTickCount function is limited to the resolution of the system timer, which is typically in the range of 10 milliseconds to 16 milliseconds
No Sweat Gary...I appreciate your efforts...since my scenario uses long ints and does not exit the loops when a match occurs the comments from Paul are not relevant for what I need.
Oh, wait ... I claim "Good Samaritan" protection. The law in Texas says that if you stop to help someone and they accept your help, then you cannot be held liable or sued for the results of your actions.
I wondered if it was possible to get much speed improvement with asm in just the basic long array comparison, which is the innermost loop. So I did an mmx version to find out. I didn't think it would be much faster, but to my surprise it is well over twice as fast on my machine. I got ~ 2.7x on avg. Maybe it would be worth trying to incorporate it. This is CC5 ver.
Changed: removed mmx pand mask
Code:
#COMPILE EXE
#DIM ALL
#REGISTER NONE
%ARRAYSIZE = 300
%COUNTCONST = %ARRAYSIZE \ 2 - 1
FUNCTION PBMAIN () AS LONG
LOCAL a1ptr, a2ptr, arrTot AS LONG, t, t2, tTot, t2tot, addMask AS QUAD
LOCAL asmTot, clrS AS LONG
REGISTER countr AS LONG
DIM arr1(%ARRAYSIZE) AS LONG
DIM arr2(%ARRAYSIZE) AS LONG
a1ptr = VARPTR(arr1(1)) 'we'll not use the zero element
a2ptr = VARPTR(arr2(1))
DO
FOR countr = 1 TO %ARRAYSIZE
arr1(countr) = RND(2243, 2246)
arr2(countr) = RND(2243, 2246)
NEXT
arrTot = 0
TIX t2
FOR countr = 1 TO %ARRAYSIZE
IF arr1(countr) = arr2(countr) THEN INCR arrTot
NEXT
TIX END t2
t2tot += t2
TIX t
!mov countr, %COUNTCONST ;1/2 %ARRAYSIZE because we compare quads rather than longs
!mov eax, a1ptr ;arr1
!mov ecx, a2ptr ;arr2
!mov edx, countr
!pxor mm2, mm2 ;need to clear mmx registers from previous loops
top:
!movq mm0, [eax+edx*8] ;2 arr1 elements
!movq mm1, [ecx+edx*8] ;2 arr2 elements
!pcmpeqd mm1, mm0 ;compare if equal
!paddd mm2, mm1 ;add to running total register
!sub edx, 1
!jae short top
!movq mm0, mm2 ;copy mm2
!psrlq mm0, 32 ;prepare to add hi and lo dwords of mm2 to get total matches
!paddd mm0, mm2 ;do the add
!movd asmTot, mm0 ;save in asmTot
!neg asmTot
!emms ;empty mmx machine state--clear fpu
TIX END t
tTot += t
IF (clrS AND 255) = 0 THEN ? STR$(t2tot / tTot, 2);
IF asmTot <> arrTot THEN ? "NOT EQUAL!": WAITKEY$
INCR clrS
LOOP
END FUNCTION
Last edited by John Gleason; 13 Nov 2009, 11:11 AM.
Reason: removed mmx pand mask
Thanks for the ASM. Not being an ASM coder, I have been unable to integrate your code into mine. You'll note in my inner loop (The first DO loop) that I compare the yth element to the zth element. Your code does not account for the decrementing of the zth element. I tried putting it into a separate register and decrementing it along with y (in your code "y" is equivalent to the countr variable I think), but I am getting an infinite loop...
I think ASM would make major improvements in my code because this is a big bottleneck in the image processing...
I just used strings in my example because it was a bit easier to show what I needed using strings, since the core algorithm remains the same in either case
No, that's not correct at all.
Optimization is application-specific; ergo your 'core algorithm' may not - and probably will not - be the same for different applications. And considering your tool of choice - the PB compiler - contains its own optimizations for LONG integers, your choice of implenentations will surely vary in the strings-v-integers scenario.
What works well in one application does not necessarily work well in another.
And I do believe we learned only in your last post this is some kind of image-processing application; I have to believe 'somewhere out there' are algorithms specifically designed to assist with image-processing... perhaps some approach to do exactly what you really want to do but is not based on "compare two strings in a sliding fashion."
overlap1 = %rawInpLength: overlap2 = 1
z = %rawInpLength
y = 1
DO
DO WHILE z >= overlap1
'DO WHILE y >= 1
'''''Do comparison here
!mov y, %COUNTCONST ;1/2 %ARRAYSIZE because we compare quads rather than longs
!mov dword ptr addMask[0], 1 ;&h0000000100000001 in addMask to mask a max &hffffffffffffffff
!mov dword ptr addMask[4], 1 ; " " "
!mov eax, a1ptr ;arr1
!mov ecx, a2ptr ;arr2
!mov edx, y
!mov ebx, z
!movq mm3, addMask ;pre-load
!pxor mm2, mm2 ;need to clear mmx registers from previous loops
top:
!movq mm0, [eax+edx*8] ;2 arr1 elements
!movq mm1, [ecx+ebx*8] ;2 arr2 elements
!pcmpeqd mm1, mm0 ;compare if equal
!pand mm1, mm3 ;mask
!paddd mm2, mm1 ;add to running total register
!sub ebx, 1
!sub edx, 1
!jae short top
!movq mm0, mm2 ;copy mm2
!psrlq mm0, 32 ;prepare to add hi and lo dwords of mm2 to get total matches
!paddd mm0, mm2 ;do the add
!movd counter, mm0 ;save in asmTot
!emms ;empty mmx machine state--clear fpu
'IF temp1(z) = temp2(y) THEN INCR counter
'z = z - 1
'y = y - 1
'WEND
WEND
DECR overlap1
INCR overlap2
z = %rawInpLength
y = overlap2
LOOP WHILE overlap2 <= %rawInpLength-1
Optimization is application-specific; ergo your 'core algorithm' may not - and probably will not - be the same for different applications. And considering your tool of choice - the PB compiler - contains its own optimizations for LONG integers, your choice of implenentations will surely vary in the strings-v-integers scenario. What works well in one application does not necessarily work well in another.
Uh...sure. Thanks for the programming 101 lesson. But in this case it was the core algorithm (the looping mechanisms) I cared about...not the particular efficiency of comparing strings or integers. Unless someone would generously provide ASM code to optimize that particular comparison, as John has.
And I do believe we learned only in your last post this is some kind of image-processing application; I have to believe 'somewhere out there' are algorithms specifically designed to assist with image-processing... perhaps some approach to do exactly what you really want to do but is not based on "compare two strings in a sliding fashion."
There are thousands of image processing algorithms Michael. The fact that this little bit of code relates to an overall image processing program i am writing does not mean that it relates to image processing "at-large". This code has nothing specific to do with image processing.
I'll look at your version Kevin, and see if I can make it go. Yes there is that problem of odd # comparisons, say, comparing a 101 overlap. The mmx code requires an even # of comparisons, so it may need 2 sections to alternate between.
On the plus side, I removed the mask "pand" command because it makes no difference if you add -1 to itself in a loop, or +1, as long as you just change the minus sign to a + when you're done! That sped it up like 20%. I'll go back and change the original post to the new code.
Code:
'new code removing pand mask:
!mov countr, %COUNTCONST ;1/2 %ARRAYSIZE because we compare quads rather than longs
!mov eax, a1ptr ;arr1
!mov ecx, a2ptr ;arr2
!mov edx, countr
!pxor mm2, mm2 ;need to clear mmx registers from previous loops
top:
!movq mm0, [eax+edx*8] ;2 arr1 elements
!movq mm1, [ecx+edx*8] ;2 arr2 elements
!pcmpeqd mm1, mm0 ;compare if equal
!paddd mm2, mm1 ;add to running total register
!sub edx, 1
!jae short top
!movq mm0, mm2 ;copy mm2
!psrlq mm0, 32 ;prepare to add hi and lo dwords of mm2 to get total matches
!paddd mm0, mm2 ;do the add
!movd asmTot, mm0 ;save in asmTot
!neg asmTot
!emms ;empty mmx machine state--clear fpu
Kevin,
I agree with Michael. It's far better to state the actual problem rather than make up a similar sounding one. A lot of this thread, and the time of those that contributed, went into trying to figure out exactly what was needed and to optimise the strings out of it when there weren't really any strings involved. A clearer statement of the problem would have helped find a solution more quickly.
John,
to my surprise it is well over twice as fast on my machine.
Why should you be surprised at ASM being faster? It usually is.
I haven't tried your code but:
1) add #ALIGN 16 just ahead of the top: label. Repeated branch targets should be aligned on a 16 byte boundary for speed
2) your loop is very short, it would probably benefit from being unrolled a bit
3) MMX instructions like to be done in blocks. If possible, read more than 1 consecutive value into the MMX registers
To make things easier, i have put the full code below. The first section is my original algorithm. The second is my attempt to integrate your code in the inner loop. This is running in PB Win 9.
Code:
#COMPILE EXE
%rawInpLength = 30000
%ARRAYSIZE = 30000
%COUNTCONST = %ARRAYSIZE \ 2 - 1
FUNCTION PBMAIN () AS LONG
REGISTER x AS LONG, y AS LONG, z AS LONG
DIM overlap1 AS LONG, overlap2 AS LONG, counter AS LONG
counter = 0
OPEN "speedtest.log" FOR OUTPUT AS #10
DIM temp1(%rawInpLength) AS STRING
DIM temp2(%rawInpLength) AS STRING
FOR x = 1 TO %rawInpLength
temp1(x) = CHR$(RND(65,90))
temp2(x) = CHR$(RND(65,90))
NEXT x
PRINT#10,"test kevin start:";TIMER
overlap1 = %rawInpLength: overlap2 = 1
z = %rawInpLength
y = 1
DO
DO WHILE z >= overlap1
DO WHILE y >= 1
'''''Do comparison here
IF temp1(z) = temp2(y) THEN INCR counter
z = z - 1
y = y - 1
WEND
WEND
DECR overlap1
INCR overlap2
z = %rawInpLength
y = overlap2
LOOP WHILE overlap2 <= %rawInpLength-1
'this section then continues the sliding forward from where the last code finished
FOR z = 0 TO %rawInpLength-1
FOR y = 1 TO %rawInpLength - z
IF temp1(y) = temp2(y+z) THEN INCR counter
NEXT y
NEXT z
PRINT#10,"Kevin original finish:";TIMER, counter
LOCAL a1ptr, a2ptr, arrTot AS LONG, t, t2, tTot, t2tot, addMask AS QUAD
LOCAL asmTot, clrS AS LONG
a1ptr = VARPTR(temp1(1)) 'we'll not use the zero element
a2ptr = VARPTR(temp2(1))
PRINT#10,"test kevin john start:";TIMER
counter = 0
overlap1 = %rawInpLength: overlap2 = 1
z = %rawInpLength
y = 1
DO
DO WHILE z >= overlap1
'DO WHILE y >= 1
'''''Do comparison here
!mov y, %COUNTCONST ;1/2 %ARRAYSIZE because we compare quads rather than longs
!mov dword ptr addMask[0], 1 ;&h0000000100000001 in addMask to mask a max &hffffffffffffffff
!mov dword ptr addMask[4], 1 ; " " "
!mov eax, a1ptr ;arr1
!mov ecx, a2ptr ;arr2
!mov edx, y
!mov ebx, z
!movq mm3, addMask ;pre-load
!pxor mm2, mm2 ;need to clear mmx registers from previous loops
top:
!movq mm0, [eax+edx*8] ;2 arr1 elements
!movq mm1, [ecx+ebx*8] ;2 arr2 elements
!pcmpeqd mm1, mm0 ;compare if equal
!pand mm1, mm3 ;mask
!paddd mm2, mm1 ;add to running total register
!sub ebx, 1
!sub edx, 1
!jae short top
!movq mm0, mm2 ;copy mm2
!psrlq mm0, 32 ;prepare to add hi and lo dwords of mm2 to get total matches
!paddd mm0, mm2 ;do the add
!movd counter, mm0 ;save in asmTot
!emms ;empty mmx machine state--clear fpu
'IF temp1(z) = temp2(y) THEN INCR counter
'z = z - 1
'y = y - 1
'WEND
WEND
DECR overlap1
INCR overlap2
z = %rawInpLength
y = overlap2
LOOP WHILE overlap2 <= %rawInpLength-1
'this section then continues the sliding forward from where the last code finished
FOR z = 0 TO %rawInpLength-1
FOR y = 1 TO %rawInpLength - z
IF temp1(y) = temp2(y+z) THEN INCR counter
NEXT y
NEXT z
PRINT#10,"test kevin john finish:";TIMER
END FUNCTION
Given the utility of ASM for speed on many common tasks, it seems that a library of ASM routines to perform common tasks would be of great use to the general PB community, especially those not well versed in ASM. Thing like doing string\numeric comparison within an array or within a pair of strings given an index would be useful. I know many of these things are littered around the boards here, but an organized library would be better.
In order to "walk the talk", i will be contributing an image processing library soon that has many image processing algorithms implemented. If you are familiar with Open CV, this has a very small subset of the features of openCV, but could be expanded endlessly. This has great utility for robotics and vision applications. this is not in ASM, but the idea of a library of many common application specific features in one DLL is the same.
For one cotnribution to such a library, i took one of Pauls ASM routines and with slight modification was able to modify it for copying rgbtriple arrays...i use it *alot*. So I have deep gratitude to Paul among others here.
Code:
FUNCTION FastCopy2DRGBTripleArray(Arr1() AS rgbtriple, Arr2() AS rgbtriple) EXPORT AS LONG
DIM src AS LONG, dst AS LONG, ln AS LONG
src = VARPTR(Arr1(1,1))
dst = VARPTR(Arr2(1,1))
ln = UBOUND(Arr1,1) * UBOUND(Arr1,2) * 3
! cld
! mov esi, src
! mov edi, dst
! mov ecx, ln
! shr ecx, 2
! rep movsd
! mov ecx, ln
! and ecx, 3
! rep movsb
END FUNCTION
Kevin, I realized I wasn't testing "apples with apples" here, so I converted your code to LONG's and tried to simplify it too. Does this look correct as it relates to your algorithm using LONG's?
Code:
#COMPILE EXE
#REGISTER NONE
%rawInpLength = 10000
%ARRAYSIZE = %rawInpLength
%COUNTCONST = %ARRAYSIZE \ 2 - 1
FUNCTION PBMAIN () AS LONG
LOCAL x AS LONG, y AS LONG, z AS LONG
DIM overlap1 AS LONG, overlap2 AS LONG, counter AS LONG
LOCAL zTemp AS LONG, tq, counter2 AS QUAD
counter = 0
OPEN "speedtest.log" FOR OUTPUT AS #10
DIM temp1(%rawInpLength) AS LONG'STRING
DIM temp2(%rawInpLength) AS LONG'STRING
FOR x = 1 TO %rawInpLength
temp1(x) = RND(65,90)
temp2(x) = RND(65,90)
' temp1(x) = CHR$(RND(65,90))
' temp2(x) = CHR$(RND(65,90))
NEXT x
' PRINT#10,"test kevin start:";TIMER
overlap1 = %rawInpLength
overlap2 = 1
'-------------------------------------------------------------------------------------------------
'-------------------------------------------------------------------------------------------------
FOR overlap2 = 1 TO %rawInpLength-1
z = %rawInpLength
FOR y = overlap2 TO 1 STEP -1
'''''Do comparison here
IF temp1(z) = temp2(y) THEN INCR counter
DECR z
NEXT
NEXT
'-------------------------------------------------------------------------------------------------
'-------------------------------------------------------------------------------------------------
'
' 'this section then continues the sliding forward from where the last code finished
FOR z = 0 TO %rawInpLength-1
FOR y = 1 TO %rawInpLength - z
IF temp1(y) = temp2(y+z) THEN INCR counter
NEXT y
NEXT z
? "Kevin original finish:" & STR$(TIMER) & STR$(counter)
END FUNCTION
Last edited by John Gleason; 14 Nov 2009, 08:15 AM.
Reason: removed unneeded DECR
, it seems that a library of ASM routines to perform common tasks would be of great use to the general PB community
Another one?
PB licensees already have such a library: the compiler itself. The compiler uses a library of ASM routines to perform common tasks; these common tasks are requested by the programmer using the BASIC programming language.
Just for the heck of it, please list some 'common tasks' for which the compiler does not provide an optimized - over a period of at least twenty years - routine you can use simply by typing the right words in your BASIC source code.
Three bucks says it's a really short list.
Not that assembly language cannot be a valuable tool, but that's all it is, a tool. A specialized tool.
If you want to write your programs in assembly language, you could always license an assembler. However, that's really underestimating the skill and experience of the PB compiler writers. Good BASIC code and good design will outperform poor assembly-language code seven days per week.
While I often whine about the quality of the documentation, or decry the design of a particular statement or function, or bemoan the lack of some 'obvious' feature, I have never ever gotten anything but fast and efficient compiled executables from the PB compilers.
We process personal data about users of our site, through the use of cookies and other technologies, to deliver our services, and to analyze site activity. For additional details, refer to our Privacy Policy.
By clicking "I AGREE" below, you agree to our Privacy Policy and our personal data processing and cookie practices as described therein. You also acknowledge that this forum may be hosted outside your country and you consent to the collection, storage, and processing of your data in the country where this forum is hosted.
Comment