To get a bit more specific about the DNA pattern-matching issue:
John M. asked
This is an important issue. There are several aspects. The first is best understood by knowing about the structure of DNA (see http://en.wikipedia.org/wiki/DNA for more detail). DNA contains two parts--a "backbone", which always has the same chemical structure, and can be thought of as a scaffold, and the "bases", which are attached to the backbone. The backbone consists of repeating units, always with the same structure, alternating a sugar and a phosphate group; the bases are attached to the sugars and hang off the side of the backbone. Look at the Wikipedia article, I can't make the fonts align to give a simple example...
The orientation of a sequence can't simply be reversed, because the structure of DNA is not the same left-to-right as right-to-left. We say that the DNA strand has "chemical polarity", and this is indicated by the terms " 5' " and " 3' ". (These terms refer to the chemical structure of the sugar). DNA sequences are by convention written with the 5' end on the left and the 3' end on the right.
Second, DNA actually contains two copies of the information, carried in two different strands that physically coil around one another (hence, the "double helix"). The polarity of the two strands is opposite, best appreciated by a simple example below; we call it "anti-parallel".
Third, the two strands carry essentially the same information; knowing the sequence of one gives you the sequence of the other, using two simple rules: A pairs with T, G pairs with C. So a short segment of a longer double-stranded DNA might have the sequence
5' GGGTACCGATTA 3'
3' CCCATGGCTAAT 5'
Hence, knowing the sequence of the top strand will give you that of the bottom one. Writing the sequence of the bottom one with the usual convention gives 5' TAATCGGTACCC 3'. We call this sequence the "complement" of the other strand, and call T the complement of A, G the complement of C.
[For the biologist, and for the living cell, having two copies of the sequence is critical. It allows the information to be copied; this can be envisioned by pulling the two strands apart (they are held together by weak interactions called hydrogen bonds), and using the pairing rules to make the complement of each strand. Then one has two double strands with the same sequence. This is termed "DNA replication". In addition, having two copies of the info is important, because if one copy gets damaged (e.g. by carcinogens) the other strand retains the info to repair the first one. This is called "DNA repair", not surprisingly.]
How does this bear on the pattern-matching issue John M raised? For two common tasks, the answer is somewhat different for each.
The first task would be to look for the presence of a short string in a long DNA. This is the goal of a BLAST search as I referenced above. The long DNA might for instance be the entire DNA sequence of an organism. You might even have a short segment of DNA sequence and want to know where it came from, in which case (as in a BLAST search) you compare it to an entire database of sequences. In this case, you need to look for a match to both strands. The simplest way to do this is to take the short string, compare it with the target sequence, and then compare the complement of the short string to the target. The BLAST search does this automatically.
If on the other hand you want to do a sequence alignment--that is, to align two long sequences to each other, such as comparing two closely-related organisms, call them A and B--it should suffice as a first step to determine which strand of the A sequence corresponds to the "top" strand of the B sequence. This can be done by matching a short segment of A to the B sequence; if no match, try the complement of the A segment. Once you know how the two sequences correspond, they should usually be aligned over very long lengths.
Of course, biology isn't that simple. It's often the case that there are small or large insertions or deletions of sequence in B relative to A (these are termed "indels"). Then the gap issue discussed earlier in the thread comes up. Sequence alignment programs have built in the capability of putting in indels. Occasionally it occurs that one segment of the sequence is inverted in organism A relative to B, called (surprise!) an "inversion". In this case, the sequences align within the inversion, or outside it, then the alignment abruptly breaks down.
John
John M. asked
2. Orientation
I don't know how important this is, but I wonder if there is a right-to-left orientation that needs to be accounted for.
Do we need to run the comparison against the target in both directions before we can determine a match? Is there a "directional" process that needs to be run?
I don't know how important this is, but I wonder if there is a right-to-left orientation that needs to be accounted for.
Do we need to run the comparison against the target in both directions before we can determine a match? Is there a "directional" process that needs to be run?
The orientation of a sequence can't simply be reversed, because the structure of DNA is not the same left-to-right as right-to-left. We say that the DNA strand has "chemical polarity", and this is indicated by the terms " 5' " and " 3' ". (These terms refer to the chemical structure of the sugar). DNA sequences are by convention written with the 5' end on the left and the 3' end on the right.
Second, DNA actually contains two copies of the information, carried in two different strands that physically coil around one another (hence, the "double helix"). The polarity of the two strands is opposite, best appreciated by a simple example below; we call it "anti-parallel".
Third, the two strands carry essentially the same information; knowing the sequence of one gives you the sequence of the other, using two simple rules: A pairs with T, G pairs with C. So a short segment of a longer double-stranded DNA might have the sequence
5' GGGTACCGATTA 3'
3' CCCATGGCTAAT 5'
Hence, knowing the sequence of the top strand will give you that of the bottom one. Writing the sequence of the bottom one with the usual convention gives 5' TAATCGGTACCC 3'. We call this sequence the "complement" of the other strand, and call T the complement of A, G the complement of C.
[For the biologist, and for the living cell, having two copies of the sequence is critical. It allows the information to be copied; this can be envisioned by pulling the two strands apart (they are held together by weak interactions called hydrogen bonds), and using the pairing rules to make the complement of each strand. Then one has two double strands with the same sequence. This is termed "DNA replication". In addition, having two copies of the info is important, because if one copy gets damaged (e.g. by carcinogens) the other strand retains the info to repair the first one. This is called "DNA repair", not surprisingly.]
How does this bear on the pattern-matching issue John M raised? For two common tasks, the answer is somewhat different for each.
The first task would be to look for the presence of a short string in a long DNA. This is the goal of a BLAST search as I referenced above. The long DNA might for instance be the entire DNA sequence of an organism. You might even have a short segment of DNA sequence and want to know where it came from, in which case (as in a BLAST search) you compare it to an entire database of sequences. In this case, you need to look for a match to both strands. The simplest way to do this is to take the short string, compare it with the target sequence, and then compare the complement of the short string to the target. The BLAST search does this automatically.
If on the other hand you want to do a sequence alignment--that is, to align two long sequences to each other, such as comparing two closely-related organisms, call them A and B--it should suffice as a first step to determine which strand of the A sequence corresponds to the "top" strand of the B sequence. This can be done by matching a short segment of A to the B sequence; if no match, try the complement of the A segment. Once you know how the two sequences correspond, they should usually be aligned over very long lengths.
Of course, biology isn't that simple. It's often the case that there are small or large insertions or deletions of sequence in B relative to A (these are termed "indels"). Then the gap issue discussed earlier in the thread comes up. Sequence alignment programs have built in the capability of putting in indels. Occasionally it occurs that one segment of the sequence is inverted in organism A relative to B, called (surprise!) an "inversion". In this case, the sequences align within the inversion, or outside it, then the alignment abruptly breaks down.
John
Comment