Continued from http://www.powerbasic.com/support/pb...ad.php?t=41843
Thread was getting way too long. I created a template we can all use to test various methods/optimizations using consistent and real data. To test a different method, just create a new sub using the real life data and variables set upi in PBMain and that's all that will have to be posted here so others can see/test.
'
Sample comparison. Note While Early Exit takes less time an makes fewer compare, it doesn't find any matches so that theory can be tossedm or at least re-examined.
============================
"I love criticism
as it's unqualified praise."
Noel Coward
============================
Thread was getting way too long. I created a template we can all use to test various methods/optimizations using consistent and real data. To test a different method, just create a new sub using the real life data and variables set upi in PBMain and that's all that will have to be posted here so others can see/test.
'
Code:
' ' ' ******************************************************* ' ' Download and unzip Data.zip at [URL]http://www.swedesdock.com/powerbasic/Data.zip[/URL] (6 meg) ' It currently contains 4 different DNA sequences. ' ' PBMain Anyone can use To test methods/optimizations/etceteraetceteraetcetera ' using Globally consistent criteria/ ' ' All that has To be done Is Create a Sub/Function and use the parameters within the Sub ' that are set up In "Sub Reset_Start_Values" and there will be consistent reporting results. ' ' ' ******************************************************* ' ' 'Notes to myself ' (Post 38) - "typically 20% for DNA searches between somewhat divergent genome sequences). " 'thoughts ' 1) maybe check only 10 with 2 mismatches. maybe faster. if get a hit then check longer sequence ' 2) maybe use value arrays? (123) '>>> 3) if first char mismatch, move to next - cannot start with a mismatch? -- No good '*************************** 'fragment 28 long with 9 unknowms 'we know in this example there must be at least 1 3 letter matching sequence in a matching fragment (28 \ 9) 'ex 1111111111111111111111111111 'ex 1101101101101101101101101101 '0 spaced as far apart as possible 'ex 1 2 3 4 5 6 7 8 9 unknowns '************* 'fragment 30 long with 6 (20%) unknowns 'we know in this example there must be at least 1 4 letter matching sequence in a matching fragment (30 \ 6) 'ex 111111111111111111111111111111 ' 111101111011110111101111011110 'ex 1 2 3 4 5 6 ' ******************************************************* 'reminder - Registering speeds up nearly 30% ' #Register None ' Register b As Long 'used second most ' Register c As Long 'used most ' ******************************************************* '************************** #Compile Exe '#Debug Display On' increases time as much as 50% longer when on DefLng a-z ' Global f_File() As String, _ f_Title() As String, _ f_Source() As String, _ f_Name() As String, _ f_Sequence() As String, _ f_Length() As Long ' ''consistency through routines Global Type_Run, query, Search As String Global Search_Length, query_Length, Match_Length, Mismatches_Allowed As Long Global Total_Compares, Total_Matches, Mismatch_Total, scan_total, Match_Found As Long Global q1, q2, Early_Exit_Flag As Long Global start_time, tmr, tmr1, tmr2 As Single '************************** Macro pf = Print #1, '************************** ' ' ******************************************************* 'http://www.powerbasic.com/support/pbforums/showthread.php?t=41843&page=4 Sub y_Post_80_Code #Register None Register c As Long Register scan_total As Long ' Local searchPtr, queryPtr, a_Ptr, b_Ptr As Byte Ptr queryPtr = StrPtr(query$) searchPtr = StrPtr(Search$) Total_Matches = 0 '' ------------------------------------------------------------------------------------------------- Rem Scan individual characters within the Window For a = 1 To Len(query$) - match_Length + 1 a_Ptr = queryPtr + a For b = 1 To Len(Search$) - match_Length + 1 Rem Initalise counters Mismatch_Total = 0: scan_total = 0: Match_Found = 0: Rem To keep running totals b_Ptr = searchPtr + b Rem Scan subfragments From Search$ And query$ For c = 0 To match_Length - 1 Incr Total_Compares Incr scan_total If ((@b_Ptr[c] - 48) And (@a_Ptr[c] - 48)) = 0 Then Incr Mismatch_Total ' If Early_Exit_Flag = 1 And c = 0 And Mismatch_Total = 0 Then Exit For Rem early Exit If too many mismatches Or match total gone beyond maximum needed If Mismatch_Total > Mismatches_Allowed Then Match_Found = 0: Exit For If scan_total - Mismatch_Total >= Matches_Needed Then Match_Found = 1 Next c Rem Found a match If Match_Found Then Incr Total_Matches Next b Next a End Sub ' ******************************************************* ' ******************************************************* ' Data at [URL]http://www.swedesdock.com/powerbasic/Data.zip[/URL] (6 meg) Sub z_Get_Fragments tmr! = Timer ln = 4 Dim f_File(1 To ln) As String, _ f_Title(1 To ln) As String, _ f_Source(1 To ln) As String, _ f_Name(1 To ln) As String, _ f_Sequence(1 To ln) As String, _ f_Length(1 To ln) As Long f_File$(1) = "C:\Only_My_Programs\DNA\Data\ID_E_Coli.txt" f_File$(2) = "C:\Only_My_Programs\DNA\Data\Id_46442158.txt" f_File$(3) = "C:\Only_My_Programs\DNA\Data\ID_126212636.txt" f_File$(4) = "C:\Only_My_Programs\DNA\Data\ID_126213112.txt" For ctr = 1 To 4 z_Parse_Dna_Data_Files(f_File$(ctr), f_Title$(ctr), f_Source$(ctr), f_Name$(ctr), f_Sequence$(ctr), f_Length(ctr)) Next ctr ln = 10 s$ = Left$(EC_Sequence$, ln) & " - " & EC_Title$ & " - " & EC_Source$ & " - " & EC_Name$ & " - " & Using$(" #, ", EC_Length) & $CrLf & $CrLf For ctr = LBound(f_File$()) To UBound(f_File$()) s$ = s$ & Left$(f_Sequence$(ctr), ln) & " - " & f_Title$(ctr) & " - " & f_Source$(ctr) & " - " & f_Name$(ctr) & " - " & Using$(" #, ", f_Length(ctr)) & $CrLf & $CrLf Next ctr ' ? s$,,Using$("Done #.## seconds ", Timer - tmr!) End Sub ' ' ******************************************************* ' Function z_Parse_DNA_Data_Files( _ pFileName As String, _ pTitle As String, _ pSource As String, _ pName As String, _ pSequence As String, _ pSequence_Length As Long) As Long ' Remove_Line_Numbering_Flag As Long, _ '*** File structure '\Title\ E Coli strand \End Title\ '\Source\ [URL]http://www.virus-evolution.org/Downloads/[/URL] \End Source\ '\Name\ E Coli \End Name\ 'GCTTTTCATTCTGACTGCAACGG ....... ' ' or files that have line numbers '\Title\ Pichia stipitis CBS 6054 chromosome 1 PICSTchr_1.2, whole genome shotgun sequence \End Title\ '\Source\ [URL]http://www.ncbi.nlm.nih.gov/nuccore/126212636[/URL] \End Source\ '\Name\ GenBank: AAVQ01000002.1 \End Name\ ' 1 aataaaaagg gggccccaat tatcaatttt cctccaagaa tacaatgaac accttgtcat ' 61 taataatata tttcctttga ttatgggcaa tttttgttca gaggggggtt tccatttcca '... '*** End File structure ErrClear fle$ = pFileName Open fle$ For Binary As #1 ln = Lof(#1) fil$ = Space$(ln) 'make big string Get #1, 1, Fil$ 'fill it with file contents Close #1 ''' get data s$ = "\Title\" : i = Len(s$) + 1 : 'start of Title s1$ = "\End Title\": i1 = InStr(fil$, s1$) - Len(s$) - 1 : 'end of Title pTitle$ = Trim$(Mid$(fil$, i, i1)) ' get title ' ? left$(pTitle$, 100),,"1" ' s$ = "\Source\": i = InStr(fil$, s$) + Len(s$):' find source of data s1$ = "\End Source\": i1 = InStr(fil$, s1$) - i - 1 pSource$ = Trim$(Mid$(fil$, i, i1)) ' s$ = "\Name\": i = InStr(fil$, s$) + Len(s$): 'find start of sequence name s1$ = "\End Name\": i1 = InStr(fil$, s1$) - i - 1 pName$ = Trim$(Mid$(fil$, i, i1)) ' get title ' s$ = "\End Name\": i = InStr(fil$, s$) + Len(s$) 'beginning of sequence pSequence$ = Mid$(fil$, i + 1) ''' got data '*** pSequence$ = Remove$(pSequence$, Any "1234567890") pSequence$ = Remove$(pSequence$, $CrLf) pSequence$ = Remove$(pSequence$, $Cr) 'JIC different format pSequence$ = Remove$(pSequence$, $Lf) 'JIC pSequence$ = Remove$(pSequence$, " ") '*** pSequence_Length = Len(pSequence) pSequence$ = UCase$(pSequence$) 'all same case Replace Any "ACGT" With "1248" In pSequence$ 'use numbers If Err Then ? fle$ & $CrLf & $CrLf & Error$(Err),, Using$(" Error # in ", Err) & FuncName$ End If '*** Function = Err End Function ' ' ******************************************************* ' Function y_PDpost118 () As Long #Register None Local searchPtr, queryPtr, a_Ptr, b_Ptr As Byte Ptr matches_needed = match_length - Mismatches_Allowed queryPtr = StrPtr(query$) For a = 0 To query_length-1: @queryPtr[a] = @queryPtr[a] -48:Next query$ = query$ + String$(32,Chr$(16)) 'add a 32 byte tail to stop overrun when 32 bytes are taken and less bytes are valid queryPtr = StrPtr(query$) 'update string pointer because it changed on the above line searchPtr = StrPtr(search$) For a = 0 To search_length-1: @searchPtr[a] = @searchPtr[a] - 48: Next search$ = search$ + String$(32,Chr$(16)) 'add a 32 byte tail to stop overrun when 32 bytes are taken and less bytes are valid searchPtr = StrPtr(search$) 'update string pointer because it changed on the above line total_matches = 0 'generate the masks used for non-multiples of 8 sequence lengths Mask0&& = &hffffffffffffffff Mask1&& = &hffffffffffffffff Mask2&& = &hffffffffffffffff Mask3&& = &hffffffffffffffff Select Case match_length Case 1 To 8 Shift Left mask0&&, 8 * match_length Case 9 To 16 Mask0&& = &h0 Shift Left mask1&&,8 * (match_length -8) Case 17 To 24 Mask0&& = &h0 Mask1&& = &h0 Shift Left mask2&&,8 * (match_length -16) Case 25 To 32 Mask0&& = &h0 Mask1&& = &h0 Mask2&& = &h0 Shift Left mask3&&,8 * (match_length -24) End Select alimit = Len(query$) - match_length -32 'take off 32 as there's 32 bytes of padding at the end For b = 1 To Len(search$) - match_length -32 'take off 32 as there's 32 bytes of padding at the end b_Ptr = searchPtr + b !mov esi,b_ptr 'get pointer into register ready for the loop !movq mm0,[esi] 'get 32 bytes to compare for this attempt !movq mm1,[esi+8] !movq mm2,[esi+16] !movq mm3,[esi+24] ' !por mm0,mask0&& 'mask off the bytes that aren't to be included !por mm1,mask1&& !por mm2,mask2&& !por mm3,mask3&& !mov edi,queryPtr 'get pointer to other string !mov edx,Mismatches_Allowed ' FOR a = LEN(query$) - match_length to 1 step -1 !mov ecx, alimit #Align 16 aloop: Incr Total_Compares '<<<<<<< GHL added this !movq mm4,[edi+ecx] 'get 32 bytes to check !movq mm5,[edi+ecx+8] !movq mm6,[edi+ecx+16] !movq mm7,[edi+ecx+24] !pand mm4,mm0 'do the 32 byte ANDs !pxor mm0,mm0 'clear mm0 after use so we have a zero register !pand mm5,mm1 !pand mm6,mm2 !pand mm7,mm3 !pcmpeqb mm4,mm0 'do the 32 bytes worth of compare with zero !pcmpeqb mm5,mm0 !pcmpeqb mm6,mm0 !pcmpeqb mm7,mm0 !paddb mm4,mm5 'add up how many mis-matched !paddb mm4,mm6 !paddb mm4,mm7 !PSADBW mm4,mm0 !movq mm0,[esi] 'reload the low 8 bytes !PExtRW eax,mm4,0 'get the answer into a CPU register, it's negative so -4 means 4 mismatches !Add al,dl 'Mismatches_Allowed 'if this overflows, condition is met and carry is set !adc total_matches,0 'add carry into total to count matches ' NEXT a !dec ecx 'a loop !jnz aloop !emms 'restore FPU state after using MMX registers before using FPU again Next b End Function ' ' ******************************************************* ' Sub z_Update_Results 'keep a record of results Open "Testing_Results02.txt" For Append As #1 s_Title$ = "Time Match Compares Comparing Using Searching in For" ' Print #1, s_Title$ 'only need to print once pf " " s$ = "#.## ## ###,###,### (#) and (#) \ \ #) \ \ #) \ \ " Print #1, Using$(s$, _ Timer - Start_Time!, _ Total_Matches, _ Total_Compares, _ q1, q2, Type_Run$, q1, f_Name$(q1), q2, f_Name$(q2)) Close #1 ' End Sub ' ' ******************************************************* ' Macro Reset_Start_Values search_Length = 6000 query_Length = 5000 Match_Length = 22 Mismatches_Allowed = 5 Total_Matches = 0 Total_Compares = 0 'using beginning of sequences to test Search$ = Left$(f_Sequence$(q2), search_Length) ' query$ = Left$(f_Sequence$(q1), query_Length) ' Start_Time! = Timer End Macro ' ' ******************************************************* ' Function PBMain () As Long Randomize 1 'get same start each time Call z_Get_Fragments 'gets samples to work with ' ******************************************************* q1 = 1 ' 1 = whole e coli sequence q2 = 1 'will start as sequence 2 below ' ******************************************************* tmr! = Timer While q2 < UBound(f_Sequence$()) 'Rem this if only want to run one test Incr q2 If q2 > UBound(f_Sequence$()) Then 'JIC ? Using$("# entered # is Maxxed out", q2, UBound(f_Sequence$())),, "Bad file number" Exit Function End If ' Reset_Start_Values 'standardize searches through all tests Type_Run$ = "Post 80 - Peter S. ," : Call y_Post_80_Code: z_Update_Results ' Reset_Start_Values 'standardize searches through all tests ' Early_Exit_Flag = 1: Type_Run$ = "Post 80 - early exit,": Call y_Post_80_Code: z_Update_Results: Early_Exit_Flag = 0 ' Reset_Start_Values 'standardize searches through all tests Type_Run$ = "PDpost118 - with Total_Compares": Call y_PDpost118: z_Update_Results ' Wend 'Rem this if only want to run one test ' ******************************************************* ? Using$("Done #.## seconds ", Timer - Tmr!),, FuncName$ ' End Function '
Time Match Compares Comparing Using Searching in For
3.38 2 236,308,770 (1) and (2) Post 80 - Peter S. 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
2.69 0 178,553,581 (1) and (2) Post 80 - early exi 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
3.38 2 236,308,770 (1) and (2) Post 80 - Peter S. 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
2.69 0 178,553,581 (1) and (2) Post 80 - early exi 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
"I love criticism
as it's unqualified praise."
Noel Coward
============================
Comment