This thread is continued here: http://www.powerbasic.com/support/pb...ad.php?t=42697
==============================
"Those who are easily shocked
should be shocked more often."
Mae West (1893-)
==============================
Announcement
Collapse
No announcement yet.
DNA Part Deux
Collapse
X
-
Back to Peter's problem
John G.,
Code looks good. I'll make the minor change when I get a chance.
Returning now to Peter's original problem in the previous thread (he states the specific problem here http://www.powerbasic.com/support/pb...9&postcount=58 ).
Here's my formulation for testing his problem. I realize it differs from the tests that others have done.
a. Take a virus sequence (not Peter's virus, but mine, which is probably much longer than his--48.5K vs ?)
b. Extract strings of various lengths from random positions. For starters I extract 100 strings of 100 bases each but this can be varied.
c. For each position in each string, change the position with a certain probability (termed the "Mutation rate" in the code below) to any of the other three possibilities, chosen with equal probability [for the cognoscenti, this is an oversimplification; A<->G and C<->T interconversions, called "transitions", are much more likely than (A or G) <-> (C or T), known as "transversions". But that's irrelevant to this problem, since any change is a mismatch ]. In other words, if RND<MutationRate, change the position. This generates a set of strings that differ from the starting one by an average close to the mutation rate. I used a mutation rate of 0.2 to get going.
d. Use the same procedure we used for the previous restriction mapping problem (with John G's major speed boost) to test for matches to the search string, allowing for a user-defined error rate, and tabulating # of matches and mismatches (i.e. not bailing when there's a mismatch--makes it take longer of course)
Code (in PB/CC 4.04--note the stylish use of #DIM ALL ) for generating query strings (equivalent to Peter's 10-20,000 sequences):
Code:#COMPILE EXE #DIM ALL FUNCTION PBMAIN () AS LONG 'This is ExtractSegments1.bas, JWL 1/4/10, purpose to extract random 100-bp segments from lambda sequence, introduce random mutations, no indels, store them into a file for pattern matching later. LOCAL LambdaSeq,a,b,c AS STRING LOCAL LengthLambda,j,k,l,r,s,t,Segment,NumberMutations AS LONG LOCAL d,e,f,MutationRate AS SINGLE OPEN "lambda.txt" FOR INPUT AS #1 LINE INPUT #1,LambdaSeq$ LengthLambda&=LEN(LambdaSeq$) PRINT "length of lambda sequence is ";LengthLambda OPEN "LambdaSegments1.txt" FOR OUTPUT AS #2 DIM RandomSegments(1 TO 100) AS STRING 'randomize 'For now generate the same strings each time '******** generate query strings at random *********** FOR j&=1 TO 100 'number is hard-wired, could be input by user Segment&=INT(RND*(LengthLambda&-100)) '-100 because don't want to choose a segment starting in last 100 bp; again 100 is hard-wired, length of query string a$=MID$(LambdaSeq$,Segment&,100) PRINT a$ 'now introduce mutations into query strings MutationRate!=0.2:NumberMutations&=0 'Mutation rate is hard-wired, could be varied FOR k&=1 TO 100 'number is length b$=REMOVE$("ACGT",MID$(a$,k&,1)) 'print b$,mid$(a$,k&,1):waitkey$ IF RND<MutationRate THEN INCR NumberMutations& r&=1+INT(RND*3+0.0001) 'print r&; MID$(a$,k&,1)=MID$(b$,r&,1) 'replace bases at random; transitions (A<->G or C<->T) equally likely as transversions ((A or G) <-> (C or T)) END IF NEXT k& PRINT a$,NumberMutations& PRINT #2,a$;",";Segment&;",";NumberMutations& NEXT j& WAITKEY$ END FUNCTION
Code:#COMPILE EXE #DIM ALL FUNCTION PBMAIN () AS LONG 'this is JWLFindMismatches1.bas, 1/4/10, purpose to find places in lambda sequence that match even in the face of mismatches ' ************* GET THE QUERY SEQUENCES *************** LOCAL i,j,k,l,LengthLambda,NumberOfQuerySeq,MaxLengthQuerySeq,Match,Mismatch AS LONG LOCAL a,b,c,SearchSequence,LambdaSeq AS STRING LOCAL tim AS SINGLE OPEN "LambdaSegments1.txt" FOR INPUT AS #1 FILESCAN #1, RECORDS TO i& DIM QuerySequences(1 TO i&) AS STRING: DIM LocationMatch(1 TO i&) AS LONG:DIM NumberMismatches(1 TO i&) AS LONG j&=0 WHILE NOT EOF(1) INCR j& LINE INPUT #1, a$ QuerySequences$(j&)=PARSE$(a$,1):LocationMatch&(j&)=VAL(PARSE$(a$,2)):NumberMismatches&(j&)=VAL(PARSE$(a$,3)) IF LEN(QuerySequences$(j&))>MaxLengthQuerySeq& THEN MaxLengthQuerySeq&=LEN(QuerySequences$(j&)) WEND NumberOfQuerySeq&=j& PRINT "Number of Query Sequences, max length are ";NumberOfQuerySeq&;MaxLengthQuerySeq& DIM QueryPositions(1 TO NumberOfQuerySeq&,1 TO MaxLengthQuerySeq&) AS STRING DIM QueryNum(1 TO NumberOfQuerySeq&,1 TO MaxLengthQuerySeq&) AS BYTE ' Put them into a 2D array FOR j&=1 TO NumberOfQuerySeq& 'print QuerySequences$(j&) FOR k&=1 TO LEN(QuerySequences$(j&)) QueryPositions$(j&,k&)=MID$(QuerySequences$(j&),k&,1) QueryNum(j&,k&)=ASC(QuerySequences$(j&),k&)':print k&,QueryNum(j&,k&) NEXT k& NEXT j& SearchSequence$="Lambda.txt" ' *********** GET the SEARCH SEQUENCE (LAMBDA FOR NOW) AND PUT INTO AN ARRAY ************* OPEN SearchSequence$ FOR BINARY AS #2 GET$ #2, LOF(#2) - 2, LambdaSeq$ 'print LambdaSeq$ LengthLambda&=LEN(LambdaSeq$) DIM Lambda(1 TO LengthLambda&) AS STRING DIM LambdaNum(1 TO LengthLambda&) AS BYTE 'tim!=TIMER FOR j&=1 TO LengthLambda& 'Lambda$(j&)=MID$(LambdaSeq$,j&,1) 'not needed for this purpose LambdaNum(j&)=ASC(LambdaSeq$,j&) NEXT j& 'PRINT "Takes ";TIMER-tim!;" sec to put sequence into an array" 'Now try to find close matches PRINT PRINT "Seq # matches # mismatches Known # ID'd Location Known location Error?" PRINT " of mismatches of match of match" PRINT tim!=TIMER FOR i&=1 TO NumberOfQuerySeq& FOR j&=0 TO LengthLambda&-MaxLengthQuerySeq& Match&=0:MisMatch&=0 FOR k&=1 TO LEN(QuerySequences$(i&)) IF QueryNum(i&,k&)=LambdaNum(j&+k&) THEN INCR Match& ELSE INCR Mismatch& 'comparing numbers not strings (John Gleason refinement) NEXT k& IF Match&/(Match&+Mismatch&)>0.6 THEN PRINT i&,Match&,Mismatch&,NumberMismatches&(i&) ,j&+1,LocationMatch(i&),:IF j&+1<>LocationMatch&(i&) THEN PRINT "Not the same" ELSE PRINT NEXT j& NEXT i& PRINT "Search took ";TIMER - tim!;" sec" PRINT "Done" WAITKEY$ END FUNCTION
The file "lambda.txt" was posted at http://www.powerbasic.com/support/pb...0&postcount=83
This takes about 7" on my (evidently pokey) computer. Not as fast as others' speed tests earlier in this thread and in the previous one, but it's looking in a long sequence for matches to longer strings (100 bases not <=30 as in most previous tests). I'd guess that if Peter ran 10000 query sequences against a virus 10000 bases long he'd find the matches in <5 minutes. The time would depend on how long his unknowns are, I never heard that figure.
Criteria for identifying a match could also be varied (hard-wired for now). It could also bail if for example the number of mismatches exceeded some threshold as was done in some earlier approaches to the problem; that would speed it up further.
One critical aspect of the pattern-matching problem, not addressed in this code, is what to do if there are small insertions or deletions in the query sequences. From Peter's brief description this doesn't seem likely in his case, but it happens all the time in other biological systems. Allowing for this possibility would make the problem MUCH more challenging. Programs like ClustalW (http://www.ebi.ac.uk/Tools/clustalw2/index.html) do this. I don't know how they work, but ClustalW allows the user to specify penalties for gaps. BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi) also puts in gaps to maximize alignments when needed.
JohnLast edited by John Little; 5 Jan 2010, 02:20 PM.
Leave a comment:
-
-
I added some "whitespace" between sections for improved readability, and some comments. Turns out it matches post #103 output, so accuracy looks good! I didn't notice post #108 until I'd done the mods, so this is still based on #103. I doubt there is a lot of difference, but certainly feel free to check if you think it'll be significant.
btw, I tried the arrays as LONG but actually saw a slowdown in time. Converting from string to numeric comparisons was the biggie.
Code:#COMPILE EXE '#DIM ALL 'I'd highly recommend using DIM ALL FUNCTION PBMAIN () AS LONG ' This is JWLPatternMatch1.bas, 1/2/10, purpose to use arrays to find matches. First make it work for unambiguous query strings (like GAATTC). OPEN "c:\TypeII_REs--3.txt" FOR INPUT AS #1 FILESCAN #1, RECORDS TO x&:PRINT "Total lines are ";x& DIM FileLines$(1 TO x&) LINE INPUT #1, FileLines$() FOR j&=1 TO x& 'Count the number of enzymes IF LEFT$(FileLines$(j&),1)<>"'" THEN INCR k& NEXT j& NumberOfEnzymes&=k& PRINT "Number of enzymes is ";NumberOfEnzymes& DIM Enzymes$(1 TO k&,1 TO 3):k&=0 '1 is enzyme name, 2 is sequence, 3 PROB TEMP is number of sites in SearchString FOR j&=1 TO x& IF LEFT$(FileLines$(j&),1)<>"'" THEN INCR k&:Enzymes$(k&,1)=EXTRACT$(FileLines$(j&)," "):Enzymes$(k&,2)=TRIM$(MID$(FileLines$(j&),31)) NEXT j& FOR b&=1 TO k& SearchString$=Enzymes$(b&,2) 'PRINT Enzymes$(b&,1),SearchString$ ',c$, NEXT b& 'Now deal with a test sequence OPEN "c:\lambda.txt" FOR BINARY AS #2 GET$ #2, LOF(#2) - 2, LambdaSeq$ '-2 because I have a $CRLF at end of data which needs to be removed 'LINE INPUT #2,LambdaSeq$ '<< very slow with big lines LengthLambda&=LEN(LambdaSeq$) PRINT "Length of sequence is ";LengthLambda& DIM Lambda$(1 TO LengthLambda&) DIM LambdaNum(1 TO LengthLambda&) AS BYTE tim!=TIMER FOR j&=1 TO LengthLambda& Lambda$(j&)=MID$(LambdaSeq$,j&,1) lambdaNum(j&)=ASC(LambdaSeq$,j&) NEXT j& tim!=TIMER NonambigEnzymes&=0 FOR j&=1 TO NumberOfEnzymes& Hits&=0 Query$=Enzymes$(j&,2) lenQuery& = LEN(Query$) 'variable added here for improved speed REDIM PositionArray$(1 TO lenQuery& ) REDIM positionByteArr(1 TO lenQuery& ) AS BYTE 'numeric array much faster than string array REDIM IsAmbiguous&(1 TO lenQuery&) Ambig&=0 IF INSTR(Query$,ANY "BDHKMNRSVWY") THEN Ambig&=1 SELECT CASE Ambig& CASE 0 'Ones without ambiguous positions INCR NonambigEnzymes& FOR k&=1 TO lenQuery& positionByteArr(k&) = ASC(Query$,k&) ' PositionArray$(k&)=MID$(Query$,k&,1) NEXT k& FOR k&=1 TO LengthLambda&-lenQuery& Match&=1 FOR l&=1 TO lenQuery& IF LambdaNum(k&+l&-1)<>positionByteArr(l&) THEN Match&=0:EXIT FOR ' IF Lambda$(k&+l&-1)<>PositionArray$(l&) THEN Match&=0:EXIT FOR NEXT l& IF Match&=1 THEN INCR Hits& NEXT k& Enzymes$(j&,3)=STR$(Hits&) CASE 1 'Those with ambiguous positions GOSUB AmbiguousArray FOR k&=1 TO LengthLambda&-lenQuery& Match&=1 FOR l&=1 TO lenQuery& SELECT CASE IsAmbiguous&(l&) CASE 0 IF LambdaNum(k&+l&-1)<>positionByteArr(l&) THEN Match&=0:EXIT FOR ' IF Lambda$(k&+l&-1)<>PositionArray$(l&) THEN Match&=0:EXIT FOR CASE 1 IF INSTR(Lambda$(k&+l&-1),ANY PositionArray$(l&))=0 THEN Match&=0:EXIT FOR END SELECT NEXT l& IF Match&=1 THEN INCR Hits& 'print Enzymes$(j&,1),Enzymes$(j&,2),mid$(LambdaSeq$,k&,lenQuery&):waitkey$ 'Just to test that it's finding the proper sites NEXT k& Enzymes$(j&,3)=STR$(Hits&) END SELECT NEXT j& PRINT "Search took ";TIMER-tim!;" sec"; ttot&& PRINT PRINT "Hit any key to show # of hits":WAITKEY$ OPEN "enzymeChkTemp1.txt" FOR OUTPUT AS #4 FOR j&=1 TO NumberOfEnzymes& PRINT #4, Enzymes$(j&,1),Enzymes$(j&,2),"Hits: ";Enzymes$(j&,3) NEXT j& ? "printed" WAITKEY$ EXIT FUNCTION AmbiguousArray: FOR k&=1 TO lenQuery& Test$=MID$(Query$,k&,1) b&=INSTR(Test$, ANY "ACGT") IF b& THEN PositionArray$(k&)=Test$:IsAmbiguous&(k&)=0:positionByteArr(k&)=ASC(Test$) IF b&=0 THEN SELECT CASE Test$ CASE "B":AmbiguousID$="CGT" CASE "D":AmbiguousID$="AGT" CASE "H":AmbiguousID$="ACT" CASE "K":AmbiguousID$="GT" CASE "M":AmbiguousID$="AC" CASE "N":AmbiguousID$="ACGT" CASE "R":AmbiguousID$="AG" CASE "S":AmbiguousID$="CG" CASE "V":AmbiguousID$="ACG" CASE "W":AmbiguousID$="AT" CASE "Y":AmbiguousID$="CT" END SELECT PositionArray$(k&)=AmbiguousID$ IsAmbiguous&(k&)=1 END IF NEXT k& RETURN END FUNCTION
Leave a comment:
-
-
Speed Demon , coowool. Now what I need to aspire to is speed AND accuracy demon.I'll look at the code more tomorrow to check that elusive accuracy.
Leave a comment:
-
-
>>is to take care of any stray characters at the end?
Yes, that's right. You don't want the -2 if there is no CR & LF there.
The LONG's are probably somewhat faster than the BYTE's in those loops, but I'm not sure if they can be made LONG's. Even if so, the speed difference won't be nearly as big as that between STRING and BYTE. It's most important to get into numeric rather than string comparisons. Number comparisons are way faster than string comparisons.
Leave a comment:
-
-
Originally posted by John Little View PostAlso I assume the statement
Code:GET$ #2, LOF(#2) - 2, LambdaSeq$
thanks!
John L
The -2 in "LOF(#2) - 2" to stop short of the $CrLf (carraige return linefeed ) at the end of the line. They aren't stray chars but line delimiters (markers) at the end of each line in a text file.
Glad to see John G here. He's one of the speed demons of PB.
=====================================================
"I myself have never been able to find out precisely
what feminism is;
I only know that people call me a feminist
whenever I express sentiments that differentiate me
from a doormat or a prostitute."
Rebecca West (1913)
=====================================================
Leave a comment:
-
-
John G.,
Yowzah! With the big search sequence of 4639675 bases it went from 145" to 38" (after I added the 2-3% improvement I mentioned earlier and fixing the upper bound for the loop, which was off by one, making it "FOR k&=0 TO LengthLambda&-lenQuery&"). Good job!
I'm puzzled why this should make such a huge difference, especially since the manual says
You should not use Byte variables in FOR/NEXT loops, as they are highly inefficient.
Code:GET$ #2, LOF(#2) - 2, LambdaSeq$
thanks!
John L
Leave a comment:
-
-
Also John (L), your code compares 1-byte strings a lot, so I converted them to number bytes (BYTE), which speeds it way up. I also made a variable for LEN(Query$) because it was called a bunch.
btw, check its correctness too.I did check, but you never know.
Code:#COMPILE EXE '#DIM ALL FUNCTION PBMAIN () AS LONG ' This is JWLPatternMatch1.bas, 1/2/10, purpose to use arrays to find matches. First make it work for unambiguous query strings (like GAATTC). OPEN "c:\TypeII_REs--3.txt" FOR INPUT AS #1 FILESCAN #1, RECORDS TO x&:PRINT "Total lines are ";x& DIM FileLines$(1 TO x&) LINE INPUT #1, FileLines$() FOR j&=1 TO x& 'Count the number of enzymes IF LEFT$(FileLines$(j&),1)<>"'" THEN INCR k& NEXT j& NumberOfEnzymes&=k& PRINT "Number of enzymes is ";NumberOfEnzymes& DIM Enzymes$(1 TO k&,1 TO 3):k&=0 '1 is enzyme name, 2 is sequence, 3 PROB TEMP is number of sites in SearchString FOR j&=1 TO x& IF LEFT$(FileLines$(j&),1)<>"'" THEN INCR k&:Enzymes$(k&,1)=EXTRACT$(FileLines$(j&)," "):Enzymes$(k&,2)=TRIM$(MID$(FileLines$(j&),31)) NEXT j& FOR b&=1 TO k& SearchString$=Enzymes$(b&,2) 'PRINT Enzymes$(b&,1),SearchString$ ',c$, NEXT b& 'Now deal with a test sequence OPEN "c:\lambda.txt" FOR BINARY AS #2 GET$ #2, LOF(#2) - 2, LambdaSeq$ 'LINE INPUT #2,LambdaSeq$ LengthLambda&=LEN(LambdaSeq$) PRINT "Length of sequence is ";LengthLambda& DIM Lambda$(1 TO LengthLambda&) DIM LambdaNum(1 TO LengthLambda&) AS BYTE tim!=TIMER FOR j&=1 TO LengthLambda& Lambda$(j&)=MID$(LambdaSeq$,j&,1) lambdaNum(j&)=ASC(LambdaSeq$,j&) NEXT j& tim!=TIMER NonambigEnzymes&=0 FOR j&=1 TO NumberOfEnzymes& Hits&=0 Query$=Enzymes$(j&,2) lenQuery& = LEN(Query$) REDIM PositionArray$(1 TO lenQuery& ) REDIM positionByteArr(1 TO lenQuery& ) AS BYTE REDIM IsAmbiguous&(1 TO lenQuery&) Ambig&=0 IF INSTR(Query$,ANY "BDHKMNRSVWY") THEN Ambig&=1 SELECT CASE Ambig& CASE 0 'Ones without ambiguous positions INCR NonambigEnzymes& FOR k&=1 TO lenQuery& positionByteArr(k&) = ASC(Query$,k&) ' PositionArray$(k&)=MID$(Query$,k&,1) NEXT k& FOR k&=1 TO LengthLambda&-lenQuery& Match&=1 ' TIX t&& FOR l&=1 TO lenQuery& IF LambdaNum(k&+l&-1)<>positionByteArr(l&) THEN Match&=0:EXIT FOR ' IF Lambda$(k&+l&-1)<>PositionArray$(l&) THEN Match&=0:EXIT FOR NEXT l& ' ? PositionArray$(l&) ' TIX END t&& ' ttot&& += t&& IF Match&=1 THEN INCR Hits& NEXT k& Enzymes$(j&,3)=STR$(Hits&) CASE 1 'Those with ambiguous positions GOSUB AmbiguousArray FOR k&=1 TO LengthLambda&-lenQuery& Match&=1 FOR l&=1 TO lenQuery& SELECT CASE IsAmbiguous&(l&) CASE 0 IF LambdaNum(k&+l&-1)<>positionByteArr(l&) THEN Match&=0:EXIT FOR ' IF Lambda$(k&+l&-1)<>PositionArray$(l&) THEN Match&=0:EXIT FOR CASE 1 IF INSTR(Lambda$(k&+l&-1),ANY PositionArray$(l&))=0 THEN Match&=0:EXIT FOR END SELECT NEXT l& IF Match&=1 THEN INCR Hits& 'print Enzymes$(j&,1),Enzymes$(j&,2),mid$(LambdaSeq$,k&,lenQuery&):waitkey$ 'Just to test that it's finding the proper sites NEXT k& Enzymes$(j&,3)=STR$(Hits&) END SELECT NEXT j& PRINT "Search took ";TIMER-tim!;" sec"; ttot&& PRINT PRINT "Hit any key to show # of hits":WAITKEY$ FOR j&=1 TO NumberOfEnzymes& PRINT Enzymes$(j&,1),Enzymes$(j&,2),"Hits: ";Enzymes$(j&,3) NEXT j& WAITKEY$ EXIT FUNCTION AmbiguousArray: FOR k&=1 TO lenQuery& Test$=MID$(Query$,k&,1) b&=INSTR(Test$, ANY "ACGT") IF b& THEN PositionArray$(k&)=Test$:IsAmbiguous&(k&)=0:positionByteArr(k&)=ASC(Test$) IF b&=0 THEN SELECT CASE Test$ CASE "B":AmbiguousID$="CGT" CASE "D":AmbiguousID$="AGT" CASE "H":AmbiguousID$="ACT" CASE "K":AmbiguousID$="GT" CASE "M":AmbiguousID$="AC" CASE "N":AmbiguousID$="ACGT" CASE "R":AmbiguousID$="AG" CASE "S":AmbiguousID$="CG" CASE "V":AmbiguousID$="ACG" CASE "W":AmbiguousID$="AT" CASE "Y":AmbiguousID$="CT" END SELECT PositionArray$(k&)=AmbiguousID$ IsAmbiguous&(k&)=1 END IF NEXT k& RETURN END FUNCTION
Leave a comment:
-
-
try this which is quick
Thanks
John
Leave a comment:
-
-
Why should it take so long to open the file? The sequence file is one long string, without any CR/LF's. Is it a mistake to have such a long string?
Code:OPEN "c:\lambda2.txt" FOR BINARY AS #2 GET$ #2, LOF(#2), LambdaSeq$ 'be sure no $CRLF (or anything else) is at the end of the line
Leave a comment:
-
-
Scaling up restriction mapping
I ran my pattern-matching program (post 103, http://www.powerbasic.com/support/pb...&postcount=103) using a much larger sequence, a bacterial genome almost as large as the one Gösta was using. Got
Number of enzymes in list is 292
Took 113.40909375 to open the file "K12_sequence.txt"
Length of sequence is 4639785
Took 1.13378125 sec to put sequence into an array
Search took 145.93296875 sec
The entire search took 146" to run, with the speed improvement Gösta suggested and the one I made as well. I presume that this time could be improved with refinements I'm not competent to make, such as pointers or ASM. Also Gösta seems to get times that are much shorter than mine for the other tasks; maybe it would be shorter for this too.
JohnLast edited by John Little; 3 Jan 2010, 03:06 PM.
Leave a comment:
-
-
Originally posted by John Little View PostJohn M., I looked at your code. I tried to run it but got a (newbie?) problem, namely it wouldn't compile (in PB/CC 4.04), giving an error 519 message ("Missing declaration: MAINLINE) at the third line in
Code:FUNCTION PBMAIN () AS LONG LOCAL lRet, i AS LONG lRet = MainLine(0) END FUNCTION
Not to sound pedantic, but "Y" is "C or T", "R" is "A or G"...sorry about that.
John
The easiest change in my code would be to add any number of possibilities to any given wildcard. Just add those characters to the string in Repl(), and pad any shorter strings in other slots using the "-" character.
Also, regarding: "Y" is "C or T", "R" is "A or G"...
Not to worry, the proper replacement chars are easily changed in the Repl() string for that wildcard.
It's a little more effort to add another wildcard, but I had shown the way by adding a "J" wildcard, then commenting out that code.
As it stands, my code will create the proper strings from a spec that contains any number of wildcard chars, in any positions.
If you're interested in having me update the code to be fully self-aware, I can work on it this week. Let me know...
-JohnM
Leave a comment:
-
-
in post #58, I provided my routine for building all the specific strings from any number of wildcards... did you try it out?
-JohnM.
Code:FUNCTION PBMAIN () AS LONG LOCAL lRet, i AS LONG lRet = MainLine(0) END FUNCTION
Not to sound pedantic, but "Y" is "C or T", "R" is "A or G"...sorry about that.
John
Leave a comment:
-
-
in post #58, I provided my routine for building all the specific strings from any number of wildcards... did you try it out?
-JohnM.
Originally Posted by Gösta H. Lovgren-2
Just curious, JWL. Did you test it? If so did it make any significant difference?Code:FOR j&=1 TO NumberOfEnzymes& Hits&=0:Query$=Enzymes$(j&,2) LengthQuery&=LEN(Query$):LengthSearched&=LengthLambda&-LengthQuery&+1 'used repeatedly in interior loops REDIM PositionArray$(1 TO LengthQuery&):REDIM IsAmbiguous&(1 TO LengthQuery&) Ambig&=0:IF INSTR(Query$,ANY "BDHKMNRSVWY") THEN Ambig&=1 SELECT CASE Ambig& CASE 0 'Ones without ambiguous positions INCR NonambigEnzymes& FOR k&=1 TO LengthQuery& PositionArray$(k&)=MID$(Query$,k&,1) NEXT k& FOR k&=0 TO LengthSearched&-1 ' ####### changed lower and upper bounds from 1 and LengthSearched Match&=1 FOR l&=1 TO LengthQuery& IF Lambda$(k&+l&)<>PositionArray$(l&) THEN Match&=0:EXIT FOR 'Before it was Lambda$(k&+l&-1) NEXT l& IF Match&=1 THEN INCR Hits& NEXT k& Enzymes$(j&,3)=STR$(Hits&) CASE 1 'Those with ambiguous positions GOSUB AmbiguousArray FOR k&=0 TO LengthSearched&-1 'changed bounds as above Match&=1 FOR l&=1 TO LengthQuery& SELECT CASE IsAmbiguous&(l&) CASE 0 IF Lambda$(k&+l&)<>PositionArray$(l&) THEN Match&=0:EXIT FOR 'Before it was Lambda$(k&+l&-1) CASE 1 IF TALLY(Lambda$(k&+l&),ANY PositionArray$(l&))=0 THEN Match&=0:EXIT FOR 'Old was TALLY(Lambda$(k&+l&-1) END SELECT NEXT l& IF Match&=1 THEN INCR Hits& ':print Enzymes$(j&,1),Enzymes$(j&,2),mid$(LambdaSeq$,k&,len(Query$)),k&:waitkey$ 'Just to test that it's finding the proper sites NEXT k& Enzymes$(j&,3)=STR$(Hits&) END SELECT NEXT j&
John
Leave a comment:
-
-
Originally posted by Gösta H. Lovgren-2 View PostOne speed tip Paul pointed out earlier was...
That way the TO limits don't have to be calculated each iteration, just once (or however many). Dunno if it would be significant in this case or not. You'd have to test it.
John Gleason (IIRC) has a great string packing algo that's extremely fast, we used putting a word connection puzzle to gether a couple years ago. (ex. TRAPS to CAUGHT). Dunno if it would be appropriate here though. Probably not.
BTW here's the link to the Word Connection thread (http://www.powerbasic.com/support/pbforums/showthread.php?t=36427&highlight=word+puzzle). That one was fun too.
========================================================
Learn to limit yourself,
to content yourself with some definite thing,
some definite work;
dare to be what you are,
learn to resign with a good grace all that you are not,
to believe in your own individuality.
Henri-Frédéric Amiel
========================================================
Leave a comment:
-
-
'John L. In post 83 (http://www.powerbasic.com/support/pb...t=42020&page=5) you posted CC code that you said showed wide variances in timing between runs.
'Here Is the identical Code In PBWin that Show the Exact same times (between .70 & .71) over several runs.
'
'Length Of sequence Is 48502
' 33 1151 2319 8490 10111 13102 16909 22852 22871 23808 23828 24228 24578 25485 27252 29015 29993 31085 33811 34185 42477 44727 45741 47564
'Found 24 GAANNNNTTC sites
'Searching 4.85 Mb 100x For RI sites Using Tally took .70259375 sec
'Searching 4.85 Mb 100x For RI sites Using InStr took .71746875 sec
'Code:#Compile Exe '#DIM ALL Function PBMain () As Long Open "lambda.txt" For Input As #1 Line Input #1,LambdaSeq$ ' Print "Length of sequence is ";Len(LambdaSeq$) msg$ = "Length of sequence is " & Str$(Len(LambdaSeq$)) & $CrLf i&=1 While i& > 0 'First (as a bonus) test search strategy for GAANNNNTTC using slight variation on Gosta's code i& = InStr(i&, LambdaSeq$, "GAA") If i& Then ' If Mid$(LambdaSeq$,i&+7,3)="TTC" Then Print i&;: Incr NumberGAAN4TTC& 'print mid$(SearchString$,i&,10) If Mid$(LambdaSeq$,i&+7,3)="TTC" Then msg$ = msg$ & Str$(i&): Incr NumberGAAN4TTC& 'print mid$(SearchString$,i&,10) Incr i& 'move over so no inf loop End If Wend 'Print:Print "Found ";NumberGAAN4TTC&; " GAANNNNTTC sites" ' Print msg$ = msg$ & $CrLf msg$ = $CrLf & msg$ & "Found "& Str$(NumberGAAN4TTC&) & " GAANNNNTTC sites" & $CrLf TimesRepeated&=100 For k&=1 To TimesRepeated& 'Make end-to-end concatenation of 100 copies of lambda sequence giving 4.8502 Mb length search string SearchString$=SearchString$+LambdaSeq$ Next k& 'PRINT "RI sites are at "; tim!=Timer For q&=1 To 100 RIsites& = Tally(SearchString$, "GAATTC") 'search for matches Next q& 'Print "Searching 4.85 Mb 100x for RI sites using TALLY took ";Timer-tim!;" sec" msg$ = msg$ & "Searching 4.85 Mb 100x for RI sites using TALLY took " & Str$(Timer-tim!) & " sec" & $CrLf tim!=Timer For q&=1 To 100 i&=1 While i& > 0 i& = InStr(i&, SearchString$, "GAATTC") If i& Then Incr i& 'move over so no inf loop End If Wend Next q& ' Print ' Print "Searching 4.85 Mb 100x for RI sites using INSTR took ";Timer - tim!;" sec" msg$ = $CrLf & msg$ & "Searching 4.85 Mb 100x for RI sites using INSTR took " & Str$(Timer - tim!) & " sec" ' WAITKEY$ ? msg$ ClipBoard Set Text msg$ End Function '
================================
An open foe may prove a curse;
but a pretended friend is worse.
Ben Franklin
================================Last edited by Gösta H. Lovgren-2; 3 Jan 2010, 08:41 AM.
Leave a comment:
-
-
from post #98:
This version (in PB/CC 4.04) allows the user to choose how many N's are allowed in the target string
problem is to find all the combinations of letters at the various ambiguous positions. It’s not very different from finding all the first numbers for a system like base 2 3 or 4
-JohnM.
Leave a comment:
-
-
One speed tip Paul pointed out earlier wasCode:FOR k&=1 TO LEN(Query$) PositionArray$(k&)=MID$(Query$,k&,1) NEXT k& FOR k&=1 TO LengthLambda&-LEN(Query$) Match&=1 FOR l&=1 TO LEN(Query$) IF Lambda$(k&+l&-1)<>PositionArray$(l&) THEN Match&=0:EXIT FOR NEXT l&
Code:[B] [COLOR=red] len_Q = Len(Query$)[/COLOR][/B] [B][COLOR=red] len_ldq = Len(LengthLambda&-Len(Query$))[/COLOR][/B] For k&=1 To [COLOR=red]len_Q[/COLOR] PositionArray$(k&)=Mid$(Query$,k&,1) Next k& For k&=1 To [COLOR=red]len_ldq[/COLOR] Match&=1 For l&=1 To Len(Query$) If Lambda$(k&+l&-1)<>PositionArray$(l&) Then Match&=0:Exit For Next l&
Using Pointers will give a HUGE speed increase, maybe 5 times faster. I haven't used pointers much myself so I can't "point" the way. They're not difficult, just something I haven't had much need for. Maybe I'll play with them down the road with your unknown solver algo.
John Gleason (IIRC) has a great string packing algo that's extremely fast, we used putting a word connection puzzle to gether a couple years ago. (ex. TRAPS to CAUGHT). Dunno if it would be appropriate here though. Probably not.
=====================================================
"As far as the laws of mathematics refer to reality,
they are not certain;
and as far as they are certain,
they do not refer to reality."
Albert Einstein
=====================================================Last edited by Gösta H. Lovgren-2; 2 Jan 2010, 07:28 PM.
Leave a comment:
-
-
General version of search strategy using arrays
To review, the task we have been trying to solve is to search a DNA sequence (the "search string") for a set of patterns given in a list of query sequences. (In molecular biology terms, the goal is to find restriction sites in a long DNA sequence.) This task is complicated by the presence of ambiguous specifiers in the query sequences (such as "R" standing for "A or G"). Although we found solutions for this problem, they were not general and it seemed that special strategies were needed for special cases. There's no guarantee that new special cases won't arise, so I wanted to find a general solution. Here it is. It solves this problem, and perhaps this approach might be extended to other cases such as Peter's original problem (my impression is that something similar was the starting point in solving that, though the discussion quickly moved to things I mostly couldn't follow such as pointers and assembly code) .
I suggested above using arrays of the query and search strings, and having the actual search be position-by-position, instead of using the entire query string, in the formCode:FOR i&=1 to LEN(SearchString$)-LEN(QueryString$) FOR j&=1 to LEN(QueryString$) IF QueryPosition(j&) <> SearchStringPosition(i&+j&-1) THEN NoMatch&=1:EXIT FOR NEXT j& NEXT i&
Code:#COMPILE EXE '#DIM ALL FUNCTION PBMAIN () AS LONG ' This is JWLPatternMatch1.bas, 1/2/10, purpose to use arrays to find matches. First make it work for unambiguous query strings (like GAATTC). OPEN "TypeII_REs--3.txt" FOR INPUT AS #1 FILESCAN #1, RECORDS TO x&:PRINT "Total lines are ";x& DIM FileLines$(1 TO x&) LINE INPUT #1, FileLines$() FOR j&=1 TO x& 'Count the number of enzymes IF LEFT$(FileLines$(j&),1)<>"'" THEN INCR k& NEXT j& NumberOfEnzymes&=k& PRINT "Number of enzymes is ";NumberOfEnzymes& DIM Enzymes$(1 TO k&,1 TO 3):k&=0 '1 is enzyme name, 2 is sequence, 3 PROB TEMP is number of sites in SearchString FOR j&=1 TO x& IF LEFT$(FileLines$(j&),1)<>"'" THEN INCR k&:Enzymes$(k&,1)=EXTRACT$(FileLines$(j&)," "):Enzymes$(k&,2)=TRIM$(MID$(FileLines$(j&),31)) NEXT j& FOR b&=1 TO k& SearchString$=Enzymes$(b&,2) 'PRINT Enzymes$(b&,1),SearchString$ ',c$, NEXT b& 'Now deal with a test sequence OPEN "lambda.txt" FOR INPUT AS #2 LINE INPUT #2,LambdaSeq$ LengthLambda&=LEN(LambdaSeq$) PRINT "Length of sequence is ";LengthLambda& DIM Lambda$(1 TO LengthLambda&) tim!=TIMER FOR j&=1 TO LengthLambda& Lambda$(j&)=MID$(LambdaSeq$,j&,1) NEXT j& tim!=TIMER NonambigEnzymes&=0 FOR j&=1 TO NumberOfEnzymes& Hits&=0 Query$=Enzymes$(j&,2) REDIM PositionArray$(1 TO LEN(Query$)) REDIM IsAmbiguous&(1 TO LEN(Query$)) Ambig&=0 IF INSTR(Query$,ANY "BDHKMNRSVWY") THEN Ambig&=1 SELECT CASE Ambig& CASE 0 'Ones without ambiguous positions INCR NonambigEnzymes& FOR k&=1 TO LEN(Query$) PositionArray$(k&)=MID$(Query$,k&,1) NEXT k& FOR k&=1 TO LengthLambda&-LEN(Query$) Match&=1 FOR l&=1 TO LEN(Query$) IF Lambda$(k&+l&-1)<>PositionArray$(l&) THEN Match&=0:EXIT FOR NEXT l& IF Match&=1 THEN INCR Hits& NEXT k& Enzymes$(j&,3)=STR$(Hits&) CASE 1 'Those with ambiguous positions GOSUB AmbiguousArray FOR k&=1 TO LengthLambda&-LEN(Query$) Match&=1 FOR l&=1 TO LEN(Query$) SELECT CASE IsAmbiguous&(l&) CASE 0 IF Lambda$(k&+l&-1)<>PositionArray$(l&) THEN Match&=0:EXIT FOR CASE 1 IF TALLY(Lambda$(k&+l&-1),ANY PositionArray$(l&))=0 THEN Match&=0:EXIT FOR END SELECT NEXT l& IF Match&=1 THEN INCR Hits& 'print Enzymes$(j&,1),Enzymes$(j&,2),mid$(LambdaSeq$,k&,len(Query$)):waitkey$ 'Just to test that it's finding the proper sites NEXT k& Enzymes$(j&,3)=STR$(Hits&) END SELECT NEXT j& PRINT "Search took ";TIMER-tim!;" sec" PRINT PRINT "Hit any key to show # of hits":WAITKEY$ FOR j&=1 TO NumberOfEnzymes& PRINT Enzymes$(j&,1),Enzymes$(j&,2),"Hits: ";Enzymes$(j&,3) NEXT j& WAITKEY$ EXIT FUNCTION AmbiguousArray: FOR k&=1 TO LEN(Query$) Test$=MID$(Query$,k&,1) b&=INSTR(Test$, ANY "ACGT") IF b& THEN PositionArray$(k&)=Test$:IsAmbiguous&(k&)=0 IF b&=0 THEN SELECT CASE Test$ CASE "B":AmbiguousID$="CGT" CASE "D":AmbiguousID$="AGT" CASE "H":AmbiguousID$="ACT" CASE "K":AmbiguousID$="GT" CASE "M":AmbiguousID$="AC" CASE "N":AmbiguousID$="ACGT" CASE "R":AmbiguousID$="AG" CASE "S":AmbiguousID$="CG" CASE "V":AmbiguousID$="ACG" CASE "W":AmbiguousID$="AT" CASE "Y":AmbiguousID$="CT" END SELECT PositionArray$(k&)=AmbiguousID$ IsAmbiguous&(k&)=1 END IF NEXT k& RETURN END FUNCTION
What I like/prefer about it is that it is completely general--it will detect any search string (i.e. it doesn't have the limitations I complained about before). E.g. I put NNNNNNNNNNNNNNNNNNNNNN into the query string file and it counted all the occurrences (i.e. all the positions minus length of N string)
I'd be surprised if the speed merchants couldn't figure out ways of making the central loops faster.
See my previous post for speculations about whether this general approach has already been used in the previous thread or the early part of this one for Peter's problem. If so, maybe some of the speed fixes for that case could be applied here.
Cheers
JohnAttached FilesLast edited by John Little; 2 Jan 2010, 08:03 PM.
Leave a comment:
-
-
Just ran it again useing for every possibility listed. The above list only checked for (2).
(1) = Target Matches in Sequence
(2) = Target Matches in Complemented Sequence
(3) = Complemented Target Matches in Sequence
(4) = Complemented Target Matches in Complemented Sequence
(5) = Target Matches in Complemented and Reversed Sequence
(6) = Complimented and Reversed Target Matches in Complemented and Reversed Sequence
(7) = Non ACGT letters in Target
Code:[FONT=Courier New]Count Name Target (1) (2) (3) (4) (5) (6) (7) 1} AarI CACCTGC 3 4 4 3 9 3 2} .AarI GCAGGTG 12 8 8 12 12 12 3} AatII GACGTC 10 28 28 10 10 10 4} AbsI CCTCGAGG 0 0 0 0 0 0 5} AccI GTAGAC 3 15 15 3 1 3 6} .AccI GTATAC 6 22 22 6 4 6 7} .AccI GTCGAC 8 37 37 8 6 8 8} .AccI GTCTAC 9 57 57 9 9 9 9} AciI CCGC 223 296 296 223 279 223 10} .AciI GCGG 502 509 509 502 502 502 ... 29,114} TGTACT TatI 8 29,115} TTAA MseI 195 29,116} TTAATTAA PacI 0 29,117} TTATAA PsiI 12 29,118} TTCAA AgsI 53 29,119} TTCAT TspDTI 76 29,120} TTCGAA BstBI 7 29,121} TTGAA AgsI 53 29,122} TTTAAA DraI 13 16,857 Duplicates found ************** ************** ************** ************** ************** Sequence took 16.39 seconds to search for every possibility ************** ************** ************** [/FONT]************** **************
=====================================================
"I myself have never been able to find out precisely
what feminism is;
I only know that people call me a feminist
whenever I express sentiments that differentiate me
from a doormat or a prostitute."
Rebecca West (1913)
=====================================================
Leave a comment:
-
Leave a comment: