Announcement

Collapse
No announcement yet.

DNA Part Deux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    #21
    To get a bit more specific about the DNA pattern-matching issue:

    John M. asked

    2. Orientation
    I don't know how important this is, but I wonder if there is a right-to-left orientation that needs to be accounted for.

    Do we need to run the comparison against the target in both directions before we can determine a match? Is there a "directional" process that needs to be run?
    This is an important issue. There are several aspects. The first is best understood by knowing about the structure of DNA (see http://en.wikipedia.org/wiki/DNA for more detail). DNA contains two parts--a "backbone", which always has the same chemical structure, and can be thought of as a scaffold, and the "bases", which are attached to the backbone. The backbone consists of repeating units, always with the same structure, alternating a sugar and a phosphate group; the bases are attached to the sugars and hang off the side of the backbone. Look at the Wikipedia article, I can't make the fonts align to give a simple example...

    The orientation of a sequence can't simply be reversed, because the structure of DNA is not the same left-to-right as right-to-left. We say that the DNA strand has "chemical polarity", and this is indicated by the terms " 5' " and " 3' ". (These terms refer to the chemical structure of the sugar). DNA sequences are by convention written with the 5' end on the left and the 3' end on the right.

    Second, DNA actually contains two copies of the information, carried in two different strands that physically coil around one another (hence, the "double helix"). The polarity of the two strands is opposite, best appreciated by a simple example below; we call it "anti-parallel".

    Third, the two strands carry essentially the same information; knowing the sequence of one gives you the sequence of the other, using two simple rules: A pairs with T, G pairs with C. So a short segment of a longer double-stranded DNA might have the sequence

    5' GGGTACCGATTA 3'
    3' CCCATGGCTAAT 5'

    Hence, knowing the sequence of the top strand will give you that of the bottom one. Writing the sequence of the bottom one with the usual convention gives 5' TAATCGGTACCC 3'. We call this sequence the "complement" of the other strand, and call T the complement of A, G the complement of C.

    [For the biologist, and for the living cell, having two copies of the sequence is critical. It allows the information to be copied; this can be envisioned by pulling the two strands apart (they are held together by weak interactions called hydrogen bonds), and using the pairing rules to make the complement of each strand. Then one has two double strands with the same sequence. This is termed "DNA replication". In addition, having two copies of the info is important, because if one copy gets damaged (e.g. by carcinogens) the other strand retains the info to repair the first one. This is called "DNA repair", not surprisingly.]

    How does this bear on the pattern-matching issue John M raised? For two common tasks, the answer is somewhat different for each.

    The first task would be to look for the presence of a short string in a long DNA. This is the goal of a BLAST search as I referenced above. The long DNA might for instance be the entire DNA sequence of an organism. You might even have a short segment of DNA sequence and want to know where it came from, in which case (as in a BLAST search) you compare it to an entire database of sequences. In this case, you need to look for a match to both strands. The simplest way to do this is to take the short string, compare it with the target sequence, and then compare the complement of the short string to the target. The BLAST search does this automatically.

    If on the other hand you want to do a sequence alignment--that is, to align two long sequences to each other, such as comparing two closely-related organisms, call them A and B--it should suffice as a first step to determine which strand of the A sequence corresponds to the "top" strand of the B sequence. This can be done by matching a short segment of A to the B sequence; if no match, try the complement of the A segment. Once you know how the two sequences correspond, they should usually be aligned over very long lengths.

    Of course, biology isn't that simple. It's often the case that there are small or large insertions or deletions of sequence in B relative to A (these are termed "indels"). Then the gap issue discussed earlier in the thread comes up. Sequence alignment programs have built in the capability of putting in indels. Occasionally it occurs that one segment of the sequence is inverted in organism A relative to B, called (surprise!) an "inversion". In this case, the sequences align within the inversion, or outside it, then the alignment abruptly breaks down.

    John

    Comment


      #22
      John L.

      Fascinating stuff. Really. From a programming pov though what's really needed here (I think) is real life sequences to look through AND the requirements for a match.

      For example a file of actual dna sequences (however long or "unruly") and the rules for a match or near match. A test, if you will. With Peter's examples we were using randomly generated data. A much better test would be with real data. (Some algos I tested worked fine with generated data but didn't seem to correlate when I tried them on real data I had dl'ed)

      Say an actual sequence and a file of 10/100/1000 other sequences to find how many matches/siblings/relatives are in the file. And how long it takes to do it. (ala Hemmings/Jefferson)

      As has been pretty clearly demonstrated (I think) there are some pretty good programmers extant here and it just may be that someone will come up with a faster better way than is being used now ...

      ==================================
      "Mr. Wagner has beautiful moments
      but bad quarters of an hour."
      Gioacchino Rossini (1792-1868)
      ==================================
      It's a pretty day. I hope you enjoy it.

      Gösta

      JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
      LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

      Comment


        #23
        Hello Gösta,

        I hope we're not talking at cross-purposes here. There is basically an unlimited supply of DNA sequences. What you want to do with them is up to you and to the folks who started this thread. Bioinformatics types and end users like myself do many different things with sequence analysis, alignment, comparisons, etc. I've alluded to a few of these things in my posts, and I've obviously only scratched the surface. For sure a lot of talented people have worked on defining the goals and approaches (and I presume the coding itself) for decades, so it's important not to reinvent the wheel. E.g. I don't know if the code the thread started with is the best out there. David Mount's book can give a good flavor of the way people think about the overall topic. If good programmers can make the algorithms more efficient, that's great!

        Finding DNA sequences: The public database called GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) contains >10^8 DNA sequences. All the sequences are in a uniform format so that the programs that use them can access them. Perhaps the best way to learn about it is to nose around in the database a bit. The link gives a place to start. In the "search" window at the top, change "Entrez" to "nucleotide" then type in the name of any protein you like. I'd suggest "myosin"; it's a big protein (one of the most important proteins in muscle, FWIW) so the gene is big too.You'll get a list of hundreds or thousands of myosin genes from a wide range or organisms. If you then click on any one of them you'll get the sequence of the gene (at the end of a lot of annotation). At the right (for the ones I've just clicked on anyway) is a link "Run BLAST". Click on that and you get alignments to a bunch of other similar (or identical) sequences.

        For example you can get to a particular human myosin sequence at http://www.ncbi.nlm.nih.gov/nuccore/...uence_RVDocSum

        Then click the "Run BLAST" button. This gives a lot of similar and not-so-similar genes, in pairwise alignments with one another. I presume (but didn't check) that these are myosin genes from humans (other sequenced individuals, for instance) and from closely and less-closely-related organisms (e.g. Pan troglodytes is chimp).

        Not sure if this is what you're looking for but it's one place to get lots of sequences anyway.

        From a programming point of view, I don't know if the programs used in BLAST etc are more optimized than the one that this thread started with. I think (but I'm not sure) that the BLAST search I just suggested is comparing the particular myosin sequence to all the other sequences in the database. You can probably tell from the search parameters. If so, it seems pretty fast to me. Also you're sharing a public resource with many others so a lot of the wait is likely due to being in a queue. I also don't know how to get access to source code for these programs. I just use them! Perhaps you can poke around on the site and find source code. I did notice that the link I posted for ClustalW had a link for source code for that. ClustalW is a multiple sequence alignment program; I think it only does proteins but I'm not sure.

        Good luck
        John

        Comment


          #24
          Also, if you're willing to assume that the alignments that BLAST gives are correct, it gives you a check on the accuracy of any code that you want to test.

          John

          Comment


            #25
            Source code for BLAST

            Source code for BLAST can be accessed at http://www.ncbi.nlm.nih.gov/BLAST/developer.shtml

            Here's a link that gives a sample of how sequence alignments can be used, mostly in ways that one couldn't readily codify since they involve a lot of specialized knowledge about the biology of the system. This is the only paper I've published that heavily involves bioinformatics. http://www3.interscience.wiley.com/j...TRY=1&SRETRY=0

            Click on "PDF", it should give the main journal article (I think it's public access); "Supporting Information" is certainly public access and contains more of the actual sequence alignments.

            John

            Comment


              #26
              > A much better test would be with real data

              How true.

              I'd like to have a nickel for every time a user's first "production" data does not look at all like what he sent me as "representative test data."
              Michael Mattias
              Tal Systems (retired)
              Port Washington WI USA
              [email protected]
              http://www.talsystems.com

              Comment


                #27
                John,

                I looked at the link. It's the same one Peter pointed to and I had dl'ed a few sequences from there before but have no idea what to look for. I tried the BLAST as you suggested and it certainly throws out a slug of data pretty darn fast but I have no idea what it's doing or what the results mean. It's beyond mere complexity to a layman.

                What I'm looking for is what I said above: a series of real sequences to compare using rules to find/reject (close) matches.

                For example I dl'ed about 150 seqs (each 255 ltrs long) from another site (I think it was a rare pediatric virus) and played around with them. One scenario I ran it counted every letter, sorted, compared and printed the results for each seq to a file in under a tenth of a second. Now I know it has no value to a dna application but does show how fast stuff can be done.

                Give us (or point to) some data and tell us what the rules are for matching. Just maybe something new will pop up.

                =====================================
                "An inconvenience is only
                an adventure wrongly considered;
                An adventure is an inconvenience
                rightly considered."
                Gilbert Keith Chesterton (1874-1936)
                =====================================
                It's a pretty day. I hope you enjoy it.

                Gösta

                JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                Comment


                  #28
                  Gösta,

                  What I'm looking for is what I said above: a series of real sequences to compare using rules to find/reject (close) matches.
                  The rules? I'm not sure what you're after here. Is the goal to make the best alignment, or to find only perfect matches? If one wants a perfect match, then (unless I'm missing something) each position has to match. Or, the Query sequence (see below) has to lie within a longer test sequence and match that perfectly. In the real world, one usually wouldn't want to reject matches that are imperfect. Instead, you would want to make an alignment between your sequence and another one. That's what BLAST does. It shows you where the matches and mismatches are, and this info can be used in a lot of different ways. I wouldn't presume to dictate how they are used; perhaps Peter can say what his goals are.

                  How one makes the alignment can get pretty complex, except for VERY similar sequences, and I'm not the right person to explain the complexities. David Mount's book is a good source. A lot of the problems come in when there are gaps in one sequence relative to another (see also at the end of the post).

                  Getting test sequences: I can at least direct you to a useful set of test sequences, which was my intent before. I'll try again.

                  The output of the BLAST search is a bunch of DNA sequences that resemble to one extent or another the sequence that is your input (called the Query). [Biologically, these DNA sequences carry the information to make a protein called myosin. Different hits in the list are myosin genes from different sources.] I gave a specific example so that you would be able to get a set of related sequences. Then you don't have to understand the biology, you just get the sequences and play with them. Here's how to get the sequences:

                  The link to the myosin gene I posted before is


                  On this page, the sequence is at the end, starting in the line after "ORIGIN". You'll have to strip out all the numbers at the beginning of the lines. Just copy and paste the sequence to a text file and write a utility to do this.

                  The output of BLAST is in several parts. The first is a schematic diagram of the alignment.

                  Next is a list of the sequences that match your Query, in a section called Descriptions. If you click on the link for each of these, you get a file like the one linked above, and you can extract the sequences the same way. This will give you as many sequences as you have the stamina to go after.

                  One column in this list is the e-value; a low number means it's almost certainly a match, and all these are. The next column is the percent identity, and for your purposes it would be useful to pick some that are close (>98%) and some not so close (say 90% and 95%).

                  Following this list is a long set of pairwise alignments between the Query sequence and each of the target sequences. I presume that these alignments resemble what you and others are trying to make. As I said before, you can check the alignments your code makes against these. You can get some idea of what the rules are by looking at the most closely-related alignments; it gets hairier when sequences are not so similar, then intelligent decisions need to be made about how to put in gaps etc. That's what bioinformatics types do.

                  There is of course nothing magical about the myosin sequences. They are just test cases, and the sequences are longer than most DNA sequences of interest. I could suggest other sequences if this one is too long.

                  I hope this helps
                  John
                  Last edited by John Little; 18 Dec 2009, 01:52 PM. Reason: Clarification

                  Comment


                    #29
                    Rules for searches involving ambiguities

                    Another common pattern-matching goal (reasons given below) involves a different set of rules; these involve a set of characters that have been given already in several posts, such as in Gösta's report in post #18. Examples are W=A or T (mnemonic is "Weak"). In this case, usually there is a short search string such as GTYRAC (Y=C or T, R=G or A) and the goal is to find sequences matching this in a long DNA sequence (call it the "target"). Not to belabor the obvious, but the goal would be to find in the target sequence all the instances of the following sequences: GTTAAC, GTCAAC, GTTGAC and GTCGAC. So the output would be a long sequence line containing the target sequence, and above it would be the search sequence, such as
                    GTCAAC GTTGAC
                    GCCCGGTAAGCCTTAATTGTCAACACGGGTTGACAATGGGG

                    {Newbie Q: How do I get the characters to line up? I mean the upper line to be scooted over so it's above the match in the lower line...}

                    In the most general case, one would also need to use the complement of the search string to get all the matches. E.g. if the search string is GAGCA its complement would be TGCTC (explained in my first post). In the present example this is unnecessary because the search string is palindromic--its complement is the same as the search string itself. There are biological reasons for that.

                    So the rules, for this example, would be Y in search string matches C or T in target, R in search string matches G or A in target.

                    <Aficionado section>
                    Here's the most common use of this pattern-matching procedure:
                    Recombinant DNA work relies heavily on the use of "restriction enzymes" (see for instance http://en.wikipedia.org/wiki/Restriction_enzyme) to cut DNA molecules at specific sites. These enzymes search for these sites and cut the DNA there. We need to know where these enzymes cut before doing the actual experiment. This is a lot easier if we know the sequence, of course. Then we do a search of the type listed above for the specific sequence that our enzyme recognizes; in this case it's the enzyme HincII. Or we make a "restriction map" showing the location of all the cut sites for all known enzymes or a useful subset of them.
                    A lot of restriction enzymes cut palindromic sequences like GAATTC or GTYRAC; the reason is that the enzyme has two parts, or subunits, and one part recognizes half the recognition sequence. But not all enzymes work this way; in those cases, the complement of the recognition sequence must also be used.
                    I can explain the meanings of the mnemonics in Gösta's table if anyone is interested.
                    <End aficionado section>

                    More detail than most want!
                    John

                    Comment


                      #30
                      Originally posted by John Little View Post
                      GTCAAC GTTGAC
                      GCCCGGTAAGCCTTAATTGTCAACACGGGTTGACAATGGGG

                      {Newbie Q: How do I get the characters to line up? I mean the upper line to be scooted over so it's above the match in the lower line...}
                      Enclose the section to be lined up in [ code ] [ / code ] brackets (no spaces used) and change the font of the enclosed to Courier. At least that's the only way I've found that works.

                      Code:
                      [FONT=Courier New]                  GTCAAC    GTTGAC[/FONT]
                      [FONT=Courier New]GCCCGGTAAGCCTTAATTGTCAACACGGGTTGACAATGGGG[/FONT]
                      As for the rest of your post(s), that's gonna take me awhile to digest {burp} ... Excuse me.

                      I'll try the "GTYRAC (Y=C or T, R=G or A)" thing tonight or tomorrow in some of the long sequences I have (sounds like it's gonna be a great snowday here in Jersey tomorrow (2' in some predictions. We have't had a total of 2' altogether in the last 5 years or more I bet.)

                      ================================================
                      Conservative, n:
                      A statesman who is enamored of existing evils,
                      as distinguished from the
                      Liberal who wishes to replace them with others.
                      Ambrose Bierce (1842-1914)
                      ================================================
                      Last edited by Gösta H. Lovgren-2; 18 Dec 2009, 08:30 PM.
                      It's a pretty day. I hope you enjoy it.

                      Gösta

                      JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                      LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                      Comment


                        #31
                        With a blizzard in my eyes and snow up to my ears (merely a light dusting in WI, I'm sure) I fooled with the GTYRAC problem, John L.

                        Here are the results:

                        4 Sequences with a total of 9,282,664 letters searched
                        Found 8,051 Matches
                        to GTYRAC (Y=C or T, R=G or A)
                        in 0.34 seconds

                        and here's the code (sans overhead - data, etc.)
                        Code:
                          
                        '
                        ' *******************************************************
                        '
                        Sub GHL_GTYRAC
                        '  Title As String * %Max_Path
                        '  Source As String * %Max_Path
                        '  Nam As String * %Max_Path
                        '  Sequence As String * 1000 * 1000 * 10 '10 million should be plenty
                        '  Seq_Length As Long 
                        'Sequences()
                          tmr! = Timer
                          z_Get_Long_Sequence_Data_Array 
                         
                         Seqs_to_Look_At = 0
                         Srch$ = "GTYRAC"
                         ReDim possibles$(1 To 100) 'way more than will be needed
                                                    '"GT Y R AC (Y=C or T, R=G or A)" 
                         Incr ctr: possibles$(ctr) = "GT C G AC"
                         Incr ctr: possibles$(ctr) = "GT C A AC"
                         Incr ctr: possibles$(ctr) = "GT T G AC"
                         Incr ctr: possibles$(ctr) = "GT T A AC"
                         ReDim Preserve possibles$(1 To ctr) ' just what we need
                         Num_Possibles = ctr
                         'Removes spaces used for clarity
                         For n = 1 To Num_Possibles   
                            s$ =  possibles$(n)
                            possibles$(n) = Remove$(s$, " ")
                         Next n
                        ' 
                        ' Main loop through sequences
                         While Seqs_to_Look_At < UBound(Sequences())
                           Incr Seqs_to_Look_At
                           Seq$ = Sequences(Seqs_to_Look_At).Sequence
                           Ltrs_total = Ltrs_total + Sequences(Seqs_to_Look_At).Seq_Length
                          'now look for matches inside sequence
                           For ctr = 1 To Num_Possibles   
                              i = 1
                              While i > 0
                                 i = InStr(i, Seq$, possibles$(ctr))
                                 If i Then 
                                   Incr matches
                                   Incr i 'move over so no inf loop
                                 End If  
                              Wend   
                            Next ctr
                         Wend                           
                         '
                         s$ = Using$("# Sequences searched with a total of #, letters",  UBound(Sequences()), Ltrs_total) & $CrLf & _
                            Using$(" Found #, Matches", Matches) & $CrLf & _
                            "to GT Y R AC (Y=C or T, R=G or A) " & $CrLf & _
                            Using$("  in #.## seconds",   Timer - tmr!)
                           ClipBoard Set Text s$, rsult
                         
                         
                         ?s$,,Using$("#.## ", rsult) & FuncName$
                        End Sub
                        '
                        ' *******************************************************
                        '
                        No attempt was made at optimization. Whe the speed racers look at it, they'll laugh (or only chuckle if they are in a charitable mood) but, nonetheless, I think that's pretty good time (assuming my logic is correct, which, of course, is always suspect, and justifiably so)..
                        Last edited by Gösta H. Lovgren-2; 19 Dec 2009, 11:00 PM.
                        It's a pretty day. I hope you enjoy it.

                        Gösta

                        JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                        LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                        Comment


                          #32
                          Hi Gösta,

                          The number of hits seems about right; if you assume that the sequence is random and that %GC (G+C) is 50% (typical for many but certainly not all DNA's), you'd expect 1/(4^4*2^2) or about 1 site/1000 positions, about what you got.

                          I have no idea how fast is fast. You might look at some on-line programs that do this. I Googled "Restriction mapper" and the first hit was http://www.restrictionmapper.org . It has a link to "code". The FAQ says it's a Perl script.

                          Fourth hit was http://www.ncbi.nlm.nih.gov/Class/NA.../DNA/dna7.html. It's on the same NCBI web site as GenBank, and gives links to other programs with these capabilities.

                          These might give you some way to figure out how fast these programs are, and what their capabilities are. At least some of them (and the one I've used for years) let you use a list of test sequences, up to a couple of hundred different test sequences anyway. The first link allows this, for example.

                          Perhaps a good benchmark for speed (not for this task but the general pattern-matching task) is the performance of BLAST. Is it clear from my previous posts what BLAST does? It checks for matches between a Query sequence (like the myosin one you used) and all the sequences in a very large database, then makes alignments with the closest ones. There are probably intermediate steps too. The database I used is called "nr/nt"; according to http://www.ncbi.nlm.nih.gov/BLAST/bl...tide_databases , this contains all the sequences in GenBank and several other public databases, so it's probably at least 10^8 sequences. It says "(excluding HTGS0,1,2, EST, GSS, STS, PAT, WGS). "; these are specialized databases that aren't too relevant to the task. So if it's 10^8 sequences (or about 10^11 bases, according to the GenBank link I posted), and it has to query each one for a match then do the alignment for the winners, that's, as you say, mighty fast. I doubt it does the alignment for any but the winners; most likely the alignment is way more computationally intensive than the initial matching test, at least for not-so-close matches with gaps. So most of the time is probably taken up with the matching.

                          Good luck from sunny Arizona
                          John
                          Last edited by John Little; 20 Dec 2009, 08:45 AM.

                          Comment


                            #33
                            My thoughts are running along the lines of improving algorithm speed, maybe develop different approach to looking for patterns, maybe a faster winnowing process, maybe ....

                            At any rate, here's a more complete report from the 4 sequences:

                            Code:
                            [FONT=Courier New]                            09:40:25      12-21-2009
                                          Searching for GTYRAC contained in the following sequences:
                             E Coli strand                                                                                      
                                           [/FONT][URL="http://www.virus-evolution.org/Downloads/"][FONT=Courier New]http://www.virus-evolution.org/Downloads/[/FONT][/URL][FONT=Courier New]                                                          
                                           Ecoli Sequence                                                                                     
                                           5,528,444 Letters long had 4,731 Possible matches
                            GTCGAC    591 matches between positions       632 and 5,528,322 
                            GTCAAC  1,151 matches between positions     6,423 and 5,527,239 
                            GTTGAC  1,118 matches between positions     1,037 and 5,527,683 
                            GTTAAC  1,871 matches between positions    13,019 and 5,527,405 
                            ***************************************************************************[/FONT]
                            [FONT=Courier New] Candida albicans SC5314 chromosome 7 Ctg19-10262, whole genome shotgun sequence                    
                                           [/FONT][URL="http://www.ncbi.nlm.nih.gov/nuccore/46442158"][FONT=Courier New]http://www.ncbi.nlm.nih.gov/nuccore/46442158[/FONT][/URL][FONT=Courier New]                                                       
                                           GenBank: AACQ01000024.1                                                                            
                                             244,073 Letters long had 188 Possible matches
                            GTCGAC     19 matches between positions     2,830 and   241,776 
                            GTCAAC     51 matches between positions     1,876 and   240,932 
                            GTTGAC     69 matches between positions     2,716 and   243,902 
                            GTTAAC     49 matches between positions    14,132 and   237,849 
                            ***************************************************************************[/FONT]
                            [FONT=Courier New] Pichia stipitis CBS 6054 chromosome 1 PICSTchr_1.2, whole genome shotgun sequence                  
                                           [/FONT][URL="http://www.ncbi.nlm.nih.gov/nuccore/126212636"][FONT=Courier New]http://www.ncbi.nlm.nih.gov/nuccore/126212636[/FONT][/URL][FONT=Courier New]                                                      
                                           GenBank: AAVQ01000002.1                                                                            
                                           1,327,701 Letters long had 1,217 Possible matches
                            GTCGAC    169 matches between positions    11,423 and 1,324,850 
                            GTCAAC    386 matches between positions        53 and 1,315,234 
                            GTTGAC    435 matches between positions     4,697 and 1,326,191 
                            GTTAAC    227 matches between positions     2,915 and 1,322,208 
                            ***************************************************************************[/FONT]
                            [FONT=Courier New] Pichia stipitis CBS 6054 chromosome 1 PICSTchr_1.1, whole genome shotgun sequence                  
                                           [/FONT][URL="http://www.ncbi.nlm.nih.gov/nuccore/126213112"][FONT=Courier New]http://www.ncbi.nlm.nih.gov/nuccore/126213112[/FONT][/URL][FONT=Courier New]                                                      
                                           Gene ID:4850976                                                                                    
                                           2,182,446 Letters long had 1,915 Possible matches
                            GTCGAC    263 matches between positions    26,896 and 2,175,665 
                            GTCAAC    657 matches between positions     7,379 and 2,174,515 
                            GTTGAC    651 matches between positions     4,268 and 2,181,300 
                            GTTAAC    344 matches between positions    31,062 and 2,176,168 
                            ***************************************************************************[/FONT]
                            
                            [FONT=Courier New]4 Sequences with a total of 9,282,664 letters searched
                             Found 8,051 Matches
                            to GTYRAC (Y=C or T, R=G or A) 
                              in 0.25 seconds
                            [/FONT]
                            A lot of work for 1/4 of a second. - Note, each individual match position is in an array as well. No reason, just something that could be done.

                            ==========================================
                            I do not believe in political movements.
                            I believe in personal movement,
                            that movement of the soul when a man
                            who looks at himself is so ashamed
                            that he tries to make some sort of change
                            within himself, not on the outside.
                            Joseph Brodsky
                            ==========================================
                            It's a pretty day. I hope you enjoy it.

                            Gösta

                            JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                            LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                            Comment


                              #34
                              Gosta,

                              Extending your code to all the possible search strings should be straightforward; just include them in your possible$() array. Maybe another SUB to make the correspondences between the non-ACGT terms and the entries in the array.

                              It seems like a good idea, as you said, to think about alternative search strategies, if you want to be thorough. One reason is that quite a few of the possible search strings contain multiple N's and it would be wasteful to include all the possibilities.

                              It may well exceed your interest in the topic, but a thorough program of the type you're developing would (at least for the specific task I mentioned above) be able to recognize a large collection of sites. The ones that people would be interested in are in a file at http://rebase.neb.com/rebase/link_proto. Actually, researchers are mostly interested in a subset of the strings in this list, but weeding out the non-interesting ones doesn't change the programming issues.
                              Each line of this file corresponds to a particular search string. The first two are
                              AarI CACCTGC (4/8)
                              AatII GACGT^C
                              The first term is the name of the enzyme cutting there (not relevant to the pattern-matching issue); the second term gives the specificity (in the ACGT's), and information about where the enzyme cuts, not interesting for the pattern-matching, either with a caret ^ or with the numbers. For your purposes these can be removed with a text editor or a PB utility.

                              Quite a few search strings contain a lot of N's, such as XmnI GACNNNNGTC. For these there's no point in making a list of all the possible sequences that fit this (in this case 256=4^4); better to find matches to GAC then look 4 bp down the line for GTC.

                              I wonder if this could be generalized to a search strategy of "if first position is G then continue else exit; if second postion is A then continue else exit" etc.

                              It's interesting how far the hits vary from random in your output. E.g. for E. coli (the organism I work with) there is a huge disparity between hits for GTCGAC vs GTTAAC. I can't think of a biological reason for that right off.

                              Peter,

                              I reread the old thread and found your statement of the problem in post #58:

                              The problem I have to solve is to generate a fast search algorithm to find the matching position of a series of 10,000 - 20,000 newly generated nucleotide sequences within a reference sequence (a virus genome in my particular case). The sequencing method generates short sequences starting at random positions in the genome, so the challenge is to find out where each of them belongs so that an alignment can be constructed and a composite "consensus" sequence generated. Because of the genetic variability of viruses, the matches are not exact. Unlike humans, variants of the same virus can differ by 20% - 30% in sequence from each other, therefore the need for pattern matching where such mismatches are allowed.
                              Here are a few questions (apologies to non-DNA types) that would help me think about the best strategy:

                              1. Are these new sequences the result of some high-throughput approach like Illumina or 454? If so, which one? How long are the sequences, and what's the length of the target?
                              2. Are the new sequences all from the same isolate--that is, are they all the same, except for the occasional mismatch or mutation?
                              3. Do you expect all the changes to be single-base changes, or could there be indels?
                              4. Are you searching against a single target virus sequence, or multiple ones?

                              One fanciful idea would be to concatenate say 1000 of your sequences together (to reduce the number of iterations of the following search), then BLAST it against the virus sequence, using http://blast.ncbi.nlm.nih.gov/Blast....TYPE=BlastHome
                              using the link at bottom to bl2seq ("Align two (or more) sequences using BLAST (bl2seq)"). This would give you a bunch of alignments; there would still be a lot of work to make the contigs, but I presume there is code out there to do that.

                              Cheers
                              John
                              Last edited by John Little; 21 Dec 2009, 09:10 PM.

                              Comment


                                #35
                                Okay, in the above example a short "fragment" (GTYRAC) with two unknowns (Y & R) each of which could represent one of two other letters (Y=C or T, R=G or A). It was solved using a manually created array:
                                Code:
                                '   'spaces inserted for clarity
                                '                            '"GT Y R AC (Y=C or T, R=G or A)" 
                                ' Incr ctr: possibles$(ctr) = "GT C G AC"
                                ' Incr ctr: possibles$(ctr) = "GT C A AC"
                                ' Incr ctr: possibles$(ctr) = "GT T G AC"
                                ' Incr ctr: possibles$(ctr) = "GT T A AC"
                                ' ReDim Preserve possibles$(1 To ctr) ' Surprisingly only 4 possibilities
                                ' Num_Possibles = ctr
                                ' 'Removes spaces used for clarity
                                ' For n = 1 To Num_Possibles   
                                '    s$ =  possibles$(n)
                                '    possibles$(n) = Remove$(s$, " ")
                                ' Next n
                                ''
                                Cool, easy peasy. Super fast (see above). Okay he thinks, what if it was a longer fragment with even more unknowns? Can't be manually creating arrays now, can we? Need an algo to do it. Shouldn't be a problem.

                                Hmmmm.... turns out it is. At least for me. Dunno if the snowstorm (18" in backyard, not a record for here, but close enough) froze my brain or I'm just too dumb to figure it out. I can't even create the array for the 2 unknowns, never mind 5 or 10 or whatever. Here's as far as I got and am stuck at:
                                Code:
                                  
                                '   send frg$ (ex. GTYRAC) 'return array of possibilities 
                                Sub GHL_Generate_Search_Fragment_Posibilities(frg As String, Possb() As String)
                                  frg$ = UCase$(frg$) 'JIC
                                '  just to test start
                                '    For ctr = 1 To 5
                                '      s$ = s$ & frg$ 
                                '    Next ctr         
                                '    frg$ = s$
                                '
                                   '(Y=C or T, R=G or A)
                                   ReDim y$(1), r$(1) 'figure I'll need these, just don't know how yet.
                                    y$(0) = "C"
                                    y$(1) = "T"
                                    r$(0) = "G"
                                    r$(1) = "A"
                                    Num_Ys = ParseCount(frg$, "Y") - 1    
                                    Num_Rs = ParseCount(frg$, "R") - 1   
                                '  
                                 
                                    ub = 1000  'should be way more than needed, will redim Preserve before exit
                                    ReDim Possb(1 To ub)
                                '  ?Using$("Ys = #   Rs = #", Num_Ys, Num_Rs),, frg$:Exit Sub 'testing
                                ' 
                                  Flag = 1
                                  ''"GT Y R AC
                                  s$ = frg$
                                  While Flag > 0
                                    Reset Flag    
                                Y_Over:    
                                    i = InStr(s$, "Y")
                                       If i Then 
                                         Mid$(s$, i, 1) = y$(0)
                                         Incr ctr
                                 
                                         Incr flag
                                       End If  
                                 
                                  Wend    
                                End Sub
                                '
                                I can picture what it should look like in my mind but it just won't come out. Can anybody help the old dummy over the hump?

                                =================================================
                                "We don't like their sound,
                                and guitar music is on the way out."
                                Decca Recording Co. rejecting the Beatles, 1962
                                =================================================
                                Last edited by Gösta H. Lovgren-2; 21 Dec 2009, 09:08 PM.
                                It's a pretty day. I hope you enjoy it.

                                Gösta

                                JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                Comment


                                  #36
                                  John,

                                  Apparently we posted at the same time as I just now saw your post. I will try to look into the links you mentioned tomorrow. Late here and brain fried right now {sigh}.

                                  Really appreciate the education, as I'm sure others here do. Just saw an interesting NatGeo program ("Explorer") on canine genetics. Apparently the most pliable DNA known to exist. ("80% of all dog breeds came into existence in the last 130 years.") Neat stuff.

                                  =================================
                                  Each year,
                                  one vicious habit rooted out,
                                  in time
                                  ought to make the worst man good.
                                  Ben Franklin
                                  =================================
                                  It's a pretty day. I hope you enjoy it.

                                  Gösta

                                  JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                  LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                  Comment


                                    #37
                                    Originally posted by Gösta H. Lovgren-2 View Post
                                    I can picture what it should look like in my mind but it just won't come out. Can anybody help ...
                                    Gösta: Nice stuff you're doing!

                                    OK, so this is not pretty, and it's certainly not elegant. It's just a brute force method that I'm using to organize my thoughts.

                                    Be warned: I've been up all night, so this may contain glaring errors!
                                    Code:
                                    incr ndx   'point to next open slot in 1-based array 
                                    
                                    s$ = ucase$(frg$) 
                                    
                                    cntY = tally$(s$, "Y")
                                    cntR = tally$(s$, "R")
                                    NumPossb = 2 ^ (cntY + cntR)  'ain't binary exponents lovely?
                                    
                                    pFirst = ndx
                                    pLast  = pFirst + NumPossb -1
                                    
                                    for i = 1 to LEN(s$)
                                       c$ = mid$(s$, i, 1) 
                                    
                                       select case c$
                                       case "C", "T", "A", "G"
                                          for j = pFirst to pLast
                                             Possb(j)   = Possb(j) & c$
                                          next j
                                       case "Y"
                                          for j = pFirst to pFirst + (NumPossb\2)
                                             Possb(j)   = Possb(j)   & "C"
                                          next j
                                    
                                          for j = pFirst + (NumPossb\2) to pLast 
                                             Possb(j)   = Possb(j)   & "T"
                                          next j
                                    
                                    'the above lines are OK for the first-occurring Y (or R), and the next lines would be OK for the last, 
                                    'but I haven't calc'd the in-betweens. 
                                    '      for j = pFirst to pLast step 2
                                    '         Possb(j)   = Possb(j)   & "C"
                                    '         Possb(j+1) = Possb(j+1) & "T"
                                    '      next j
                                    'Need to account for which of the 2^n we're handling and set the alternation accordingly... 
                                    
                                       case "R"
                                          for j = pFirst to pFirst + (NumPossb\2)
                                             Possb(j)   = Possb(j)   & "G"
                                          next j
                                    
                                          for j = pFirst + (NumPossb\2) to pLast 
                                             Possb(j)   = Possb(j)   & "A"
                                          next j
                                    
                                    
                                       end select
                                    next i
                                    As the old quote goes: "First make it work, then make it right, THEN make it fast." In the above, I'm just trying to make sure it's doing what it's supposed to. Improvments may follow...after I get some sleep...

                                    -jhm

                                    @John L: I'm getting a lot from your posts; thanks!
                                    Last edited by John Montenigro; 22 Dec 2009, 08:36 AM. Reason: note to JohnL; change to the Y and R cases

                                    Comment


                                      #38
                                      Replying to myself... I DO need some sleep! But before I do, I added "Interval", but I have not checked it thoroughly.

                                      OK, Gosta, the ball's in your court...
                                      -jhm

                                      Code:
                                      incr ndx   'point to next open slot in 1-based array 
                                      
                                      s$ = ucase$(frg$)
                                      
                                      cntY = tally$(s$, "Y")
                                      cntR = tally$(s$, "R")
                                      NumWilds = cntY + cntR    'let's say NumWilds=3
                                      NumPossb = 2 ^ NumWilds   '...so then NumPossb = 8
                                      
                                      Interval = NumPossb '...to start with...
                                      	
                                      pFirst = ndx
                                      pLast  = pFirst + NumPossb -1
                                      
                                      
                                      for i = 1 to LEN(s$)
                                         
                                         c$ = mid$(s$, i, 1) 
                                         select case c$
                                         case "C", "T", "A", "G"
                                            for j = pFirst to pLast
                                               Possb(j)   = Possb(j) & c$
                                            next j
                                         case "Y"
                                            Interval = Interval\2  '...make the adjustment
                                            for j = pFirst to pFirst + (NumPossb\Interval)
                                               Possb(j)            = Possb(j)            & "C"
                                               Possb(j + Interval) = Possb(j + Interval) & "T"
                                            next j
                                      
                                         case "R"
                                            Interval = Interval\2  '...make the adjustment 
                                            for j = pFirst to pFirst + (NumPossb\Interval)
                                               Possb(j)            = Possb(j)            & "G"
                                               Possb(j + Interval) = Possb(j + Interval) & "A"
                                            next j
                                      
                                         end case
                                      next i
                                      Last edited by John Montenigro; 22 Dec 2009, 10:20 AM.

                                      Comment


                                        #39
                                        John,

                                        Just gave it a run. Fixed a couple typos. It still needs work {grin}. but it gives me a place to start. you got farther than me. Hopefully it will break my brainjam. Just call it from PBMain and you'll see what I mean.

                                        Code:
                                        ' *******************************************************
                                        ' From PMain:
                                        '  Local Possb() As String
                                        '   Call John_M_Generate_Search_Fragment_Possibilities("GTYRAC", Possb())
                                        '  ? "finished" 
                                        '**************************************************
                                        Sub John_M_Generate_Search_Fragment_Possibilities(frg As String, Possb() As String)
                                            ub = 1000  'should be way more than needed, will redim Preserve before exit
                                            ReDim Possb(1 To ub)
                                           Incr ndx   'point to next open slot in 1-based array 
                                           
                                           s$ = UCase$(frg$)
                                           
                                           cntY = Tally(s$, "Y")
                                           cntR = Tally(s$, "R")
                                           NumWilds = cntY + cntR    'let's say NumWilds=3
                                           NumPossb = 2 ^ NumWilds   '...so then NumPossb = 8
                                           
                                           Interval = NumPossb '...to start with...
                                            
                                           pFirst = ndx
                                           pLast  = pFirst + NumPossb -1
                                           
                                           
                                           For i = 1 To Len(s$)
                                              'Cur$ = Possb(j) '<<< ???
                                           
                                              c$ = Mid$(s$, i, 1) 
                                              Select Case c$
                                              Case "C", "T", "A", "G"
                                                 For j = pFirst To pLast
                                                    Possb(j)   = Possb(j) & c$
                                                 Next j
                                              Case "Y"
                                                 Interval = Interval\2  '...make the adjustment
                                                 For j = pFirst To pFirst + (NumPossb\Interval)
                                                    Possb(j)            = Possb(j)            & "C"
                                                    Possb(j + Interval) = Possb(j + Interval) & "T"
                                                 Next j
                                           
                                              Case "R"
                                                 Interval = Interval\2  '...make the adjustment 
                                                 For j = pFirst To pFirst + (NumPossb\Interval)
                                                    Possb(j)            = Possb(j)            & "G"
                                                    Possb(j + Interval) = Possb(j + Interval) & "A"
                                                 Next j
                                           
                                              End Select 'Case
                                           Next i
                                          ReDim Preserve Possb(1 To pLast) 'pare it down
                                          Array Sort Possb()
                                          
                                          'insert here for comparison to first effort
                                          ReDim pb(1 To pLast) As String
                                         Incr ctr: pb$(ctr) = "GT C A AC"
                                         Incr ctr: pb$(ctr) = "GT C G AC"
                                         Incr ctr: pb$(ctr) = "GT T A AC"
                                         Incr ctr: pb$(ctr) = "GT T G AC"
                                         Num_Possibles = ctr
                                        ' 'Removes spaces used for clarity
                                         For n = 1 To Num_Possibles   
                                            s$ =  pb$(n)
                                            pb$(n) = Remove$(s$, " ")
                                         Next n
                                          
                                          Reset s$ 
                                          s$ = Space$(6) & frg$ & Str$(Len(frg$)) & " yields:" & $CrLf
                                           s$ = s$ & "      John               Should be" & $CrLf 
                                            u$ =      "###}  \            \ ##  \             \ ##"
                                            For ctr = 1 To plast
                                               s$ = s$ & Using$(u$,Ctr, Possb(ctr),Len(Possb(ctr)), pb$(ctr), Len(pb$(ctr))) & $CrLf 
                                            Next ctr                                                              
                                            ClipBoard Set Text s$
                                            ? s$,,FuncName$
                                        End Sub
                                        '
                                        ' *******************************************************
                                        
                                        '
                                        ' *******************************************************
                                        Here's the result:
                                              GTYRAC 6 yields:
                                              John               Should be
                                          1}  GTCAGAC         7  GTCAAC           6
                                          2}  GTCGAC          6  GTCGAC           6
                                          3}  GTTAGAC         7  GTTAAC           6
                                          4}  GTTCAGAC        8  GTTGAC           6


                                        =================================
                                        I stepped from plank to plank
                                        A slow and cautious way
                                        The Stars about my Head I felt
                                        About my Feet the Sea.

                                        I knew not but the next
                                        Would be my final inch
                                        This gave me that precarious Gait
                                        Some call Experience.
                                        Emily Dickensen
                                        =================================
                                        Last edited by Gösta H. Lovgren-2; 22 Dec 2009, 11:26 AM. Reason: added array comparison results
                                        It's a pretty day. I hope you enjoy it.

                                        Gösta

                                        JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                        LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                        Comment


                                          #40
                                          Okay, playing some more (no solve yet though {sigh}) It struck me (distractionally) that JM (John M) used TALLY where I used PARSECOUNT.

                                          Hmmmm... Which is faster/better? Well ... it turns out both are the same. So I tested against Old Fogey Methodology using INSTR and WHILE/WEND and surprisingly iterating Instr is twice as fast while doing more work.

                                          Code:
                                           ' Main loop through sequences
                                           Seq_to_Look_At = 0
                                           While Seq_to_Look_At < UBound(Sequences())
                                             Incr Seq_to_Look_At
                                             Seq$ = Sequences(Seq_to_Look_At).Sequence
                                             Ltrs_total = Ltrs_total + Sequences(Seq_to_Look_At).Seq_Length
                                            'now look for matches inside sequence
                                             For ctr = 1 To Num_Possibles   
                                           
                                                    [B]'Tally takes .45 secs[/B]
                                          '       Seq_Match_Total(Seq_to_Look_At) = Seq_Match_Total(Seq_to_Look_At) + Tally(seq$, possibles$(ctr))
                                          '       Seq_Frag_Matches(Seq_to_Look_At, ctr) = Tally(seq$, possibles$(ctr))
                                           
                                                     [B]'ParseCount takes .46 secs[/B]
                                          '       Seq_Match_Total(Seq_to_Look_At) = Seq_Match_Total(Seq_to_Look_At) + ParseCount(seq$, possibles$(ctr)) - 1
                                          '       Seq_Frag_Matches(Seq_to_Look_At, ctr) = ParseCount(seq$, possibles$(ctr)) - 1
                                           
                                                    [B]'All this takes .25 secs AND does more - Saves matching positions in main sequence[/B]
                                                i = 1   
                                                Reset matches
                                                While i > 0                                                          
                                                   i = InStr(i, Seq$, possibles$(ctr))
                                                   If i Then 
                                                     Incr Seq_Match_Total(Seq_to_Look_At)
                                                     Incr Seq_Frag_Matches(Seq_to_Look_At, ctr)
                                                     Incr matches_Total
                                                     Incr Matches                                         
                                                     Report(Seq_to_Look_At, ctr, Matches) = i 'address of match
                                                     Incr i 'move over so no inf loop 
                                                   End If
                                                Wend 
                                           
                                              Next ctr
                                           Wend
                                          Each was run multiple times with consistent results (within 1or2/100's each time through). Note "overhead" (empty For Loop) is about 5/100's. Note also it would be probably faster yet using a For loop instead of a While.

                                          Funny where the vagaries iof the mind take you. And now back to the regularly scheduled programming (solving original problem ....)

                                          ====================================
                                          Thou, too, sail on, O Ship of State!
                                          Sail on, O Union, strong and great!
                                          Humanity with all its fears,
                                          With all the hopes of future years,
                                          Is hanging breathless on thy fate!
                                          Henry Wadsworth Longfellow.
                                          ====================================
                                          Last edited by Gösta H. Lovgren-2; 22 Dec 2009, 09:23 PM.
                                          It's a pretty day. I hope you enjoy it.

                                          Gösta

                                          JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                          LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                          Comment

                                          Working...
                                          X
                                          😀
                                          🥰
                                          🤢
                                          😎
                                          😡
                                          👍
                                          👎