Announcement

Collapse
No announcement yet.

Searching with mismatches

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Doh! the old DNA sequence matching problem.

    Have you tried the longest substring algo out of thic post of Gary Beene's?

    http://www.powerbasic.com/support/pb...ad.php?t=41543

    Comment


    • #22
      Peter, your algo doesn't seem excessively slow. Here are 5-letter strings. You can change the match lengths for more or less matches--I used 28 and 9 in this example:
      Code:
      #COMPILE EXE
      DEFLNG a-z
      
      FUNCTION PBMAIN () AS LONG
         LOCAL searchFragPtr, queryFragPtr AS BYTE PTR, printThin AS LONG
      
          CLS: PRINT "Sequence scan"
      
          'Make arbitrary TGACU data strings
          query$ = SPACE$(155 * 40)
          search$ = SPACE$(120 * 40)
          FOR ii = 1 TO 155 * 40
             ASC(query$, ii) = RND(1, 5)
             IF ii <= 120 * 40 THEN ASC(search$, ii) = RND(1, 5)
          NEXT
          REPLACE ANY CHR$(1,2,3,4,5) WITH "TGACU" IN query$
          REPLACE ANY CHR$(1,2,3,4,5) WITH "TGACU" IN search$
          
          REM specifying what match is required, ie at least 21 characters out of a text window of 30
          match_length = 28: mismatch_target = 9: REM maximum number of tolerated mismatches, ie. 30 minus 21
      
          REM scan individual characters within the window
          FOR a = 1 TO LEN(query$) - match_length
      
              query_fragment$ = MID$(query$, a, match_length)
              queryFragPtr = STRPTR(query_fragment$)
      
              FOR b = 1 TO LEN(search$) - match_length
      
                  search_fragment$ = MID$(search$, b, match_length)
                  searchFragPtr = STRPTR(search_fragment$)
      
                  REM Initalise counters
                  mismatch_total = 0: scan_total = 0: match_found = 0: REM to keep running totals
      
                  REM Scan subfragments from search$ and query$
                  FOR c = 1 TO match_length
      
                      INCR scan_total
                      'You have a choice here of ASC or even faster, index pointers.
                      IF @searchFragPtr[c-1] <> @queryFragPtr[c-1]  THEN INCR mismatch_total
      '               IF ASC(query_fragment$, c) <> ASC(search_fragment$, c) THEN INCR mismatch_total
      '               IF UCASE$(mid$(query_fragment$, c, 1)) <> UCASE$(MID$(search_fragment$, c, 1)) THEN INCR mismatch_total
      
                      REM early exits if too many mismatches or match total gone beyond maximum needed
                      IF mismatch_total > mismatch_target THEN EXIT FOR
                      IF scan_total - mismatch_total >= match_length - mismatch_target THEN match_found = 1: EXIT FOR
      
                  NEXT c
      
                  REM Found a match
                  IF match_found THEN GOSUB Report_match
      
              NEXT b
      
          NEXT a
      
          PRINT: PRINT "The end"
          WAITKEY$
      
          EXIT FUNCTION
      
      Report_match:
      
          INCR printThin
          IF (printThin AND 0) = 0 THEN    'only print 1 of every 512 matches
             PRINT: PRINT "Match found at"; a; "in the query sequence and"; b; "in the search sequence"
             PRINT query_fragment$
      
             FOR d = 1 TO match_length
                 IF UCASE$(MID$(query_fragment$, d, 1)) = UCASE$(MID$(search_fragment$, d, 1)) THEN PRINT "|"; ELSE PRINT " ";
             NEXT d
      
             PRINT: PRINT search_fragment$
                            
          END IF
      
      
      RETURN
      
      END FUNCTION

      Comment


      • #23
        I haven't thought this all the way through, but if you were to substitute as follows:
        0 for the unknown
        1 for the A
        2 for the C
        4 for the G
        8 for the T
        wouldn't a simple(or perhaps not so simple) summing algorithm suffice for finding matches?
        You would have to convert your string of 0s,1s,2s,4s,and 8s to longs to do your matching, but since the compiler is geared to longs it may give you an increase in speed.
        Rod
        In some future era, dark matter and dark energy will only be found in Astronomy's Dark Ages.

        Comment


        • #24
          Just for fun, I thought I'd take a shot at it. Turns out my powder is all wet. {sigh}. There is a serious error in logic in my Find_Matches Sub but dogged if I can find it.{rrrggghhh}. It generates nearly as many matches as the string is long until 8 or 9 then drops off to 1 match per sequence.

          '
          Code:
          'PBWIN 9.02 - WinApi 05/2008 - XP Pro SP3
          #Dim All 
          #Compile Exe  
          #Optimize SPEED
          '#Debug Display On'off for production code
           
          #Include "WIN32API.INC"
          '
          Global hdlg As Dword                
          Global Start_Matching, Match_Count() As Long
          Global Sequence, DNA_Pieces(), Look_For()  As String
          ' 
          Macro Common_Locals 'Macro easier than retyping and maintains coding consistency
            Local idd, Stile, r, Row, col, hght, wd, Longest,ctr, ln, ln1, i, i1 As Long
            Local Match_Len As Long
            Local tmr As Double
            Local  u, l, s, tmp() As String
          End Macro  
          '
          Sub Generate_Random_Sequence
            common_locals                                                                   
             tmr = Timer
             ln = 1024 * 100 '1024 * 10 'million chars   
             Sequence$ = Space$(ln) 'create big string
             ln1 = 4 'number of dna pieces
             Dim DNA_Pieces(1 To ln1)
             DNA_Pieces(1) = "A"
             DNA_Pieces(2) = "C"
             DNA_Pieces(3) = "G"
             DNA_Pieces(4) = "T"
          '   DNA_Pieces(5) = "?"                
          '
             Randomize Timer
          '         
            For ctr = 1 To ln    'fill in the string with random dna chars   
                r = Rnd(1, ln1)                      
          '      If r = 5 Then 'don't want too many unknowns                   
          '         r = Rnd(1, ln1)
          ''         Control Set Text hdlg, idd, Using$(u$, ctr, Timer - tmr):Control ReDraw hdlg, idd   
          '         If r = ln1 Then '1 in 125?
          '            r = Rnd(1, ln1)   
          '         End If
          '      End If
                Mid$(Sequence$, ctr) = DNA_Pieces(r)
            Next ctr 
          '
          '
            Match_Len = 20 ' longest sequence to match
            Dim Look_For$(1 To Match_Len), Match_Count(1 To Match_Len)                           
            For ctr = 1 To Match_Len  'now create strings to search for     
                Look_For$(ctr) = Mid$(Sequence$, 1, ctr)
            Next ctr 
          '
          End Sub                   
          '
          [B]Sub Find_Matches[/B]
          [B] common_locals                                                                   [/B]
          [B]'  [/B]
          [B] Start_Matching = 6 'minmum number of chars to match[/B]
          [B]  tmr = Timer[/B]
          [B]'[/B]
          [B] For ctr = Start_Matching To UBound(Look_For$())[/B]
          [B]   i = 1  'start at 1st char[/B]
          [B]Fm10:  [/B]
          [B]    If InStr(i, Sequence$, Look_For$(ctr)) Then 'got a match[/B]
          [B]      Incr Match_Count(ctr) 'count it[/B]
          [B]      Incr i 'start looking further on in the Sequence$[/B]
          [B]      GoTo FM10  'look for another[/B]
          [B]    End If[/B]
          [B] Next ctr[/B]
          [B]'[/B]
          [B]End Sub[/B]
          '
          '
          Function PBMain
            Common_Locals
          '            
            tmr = Timer
               Call Generate_Random_Sequence   
          '     
               Find_Matches
          '     
            s$ = Left$(Sequence$, 75) & $CrLf & $CrLf 
            For ctr = Start_Matching To UBound(Look_For$())
                s$ = s$ & Using$("##)  ", ctr) & Look_For$(ctr) & Using$(" #, ", Match_Count(ctr)) & $CrLf 
            Next ctr 
            ? s$,,Using$(" Searched #, sequence in  ,.## seconds",Len(Sequence$), Timer - tmr )
           
          End Function  'Applikation beenden
          '
          A currently set (6 char match), it only take a couple seonds to run if any of the foolhardy want to try to look into the (obviously) twisted logic of my mind.

          ==================================
          Any fool can
          criticize, condemn, and complain,
          and most fools do.
          Ben Franklin
          ==================================
          Last edited by Gösta H. Lovgren-2; 30 Oct 2009, 10:54 PM.
          It's a pretty day. I hope you enjoy it.

          Gösta

          JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
          LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

          Comment


          • #25
            Yes, DNA sequence representation is often on the form 1, 2, 4 and 8, as it allows undetermined bases to be represented. With the following code:

            1 : A
            2 : C
            4 : G
            8 : T

            Then a purine, R (G or A) would be represented as 5 (0101 in binary), an unknown base as 15 (1111) and a missing base as 0. Matching is then done by "AND"-ing the numbers, so C (0010) doesn't match R (0101), ie. equals zero, but a G does (0101 AND 0100 = 0100).

            In practice, this type of coding is not usually neessary for large scale DNA sequence scanning and if optimisation was required then 2 bits (0, 1, 2, 3) would do apart from the problem of representing an unknown base.

            One method that would seem to avoid the scale-up problem is to pre-classify the seach sequence into an array. If the search window length was 20, I suppose you could make a 20 dimensional matrix, which would hold the location of each occurrence of a specific sequence. Therefore the sequence AAAACCCCGGGGTTTTAAAA starting at position 4378 in the search sequence would lead to the value 4378 being assigned to the array at position:

            lookup_array(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 0, 0 , 0) = 4378

            If the sequence was RAAACCCCGGGGTTTTAAAA, then

            lookup_array(3, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 0, 0 , 0) would also be assigned to 4378

            When searching for a sequence, the 20 base window in the query sequence would be converted to the matrix coordinates and the matrix checked to see whether it contains a location. Ultimately, the search time would be proportional to the length of search string only, which does I suppose solve the original scaling problem.

            However, it doesn't allow searching for imperfect matches

            Obviously there would have to some way to record multiple positions with the same 20 base sequence.

            And of course the storage space would be horrible for a sensible search length.

            I used to program in a language with what were called sparse arrays where this type of solution would be feasible. In fact the equivalent statement in this language:

            DIM Lookup_array(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4)

            would assign (almost) no memory or storage space until values were actually inserted into the array. The array storage treates the data as a chain, and each data entry also records the positions of the next and previous elements as they are entered.

            Maybe somebody has written a sparse array function in Powerbasic.

            Anyway, the kids are screaming and the wife is angry so I'd better stop!

            Peter

            Comment


            • #26
              Maybe somebody has written a sparse array function in Powerbasic.
              If they did, they used a User Defined Type since PB only allows 8 dimensions.
              Rod
              In some future era, dark matter and dark energy will only be found in Astronomy's Dark Ages.

              Comment


              • #27
                Isn't it amazing, all the suggestions and ideas you get, when you explain what the actual application requirement is?

                That is, when you ask for help with the WHAT rather than help implementing some predetermined HOW?
                Michael Mattias
                Tal Systems (retired)
                Port Washington WI USA
                [email protected]
                http://www.talsystems.com

                Comment


                • #28
                  Okay, fooling around some more. Created a billion+ character random string of the 4 dna letters (ACGT) then did a search of of it using 6 to 20 char lengths for matches. Results:

                  Searched 1,073,741,824 character sequence in 148.45 seconds
                  AATATGCTGGAGACTCTGGAACATGCTTCTCCAAAAAATCTTGTTGAGCCTTCTGTAGTATAACGAGAAATGGGC
                  6) AATATG found 262,238 matches Last at posn 1,073,741,261
                  7) AATATGC found 65,758 matches Last at posn 1,073,741,261
                  8) AATATGCT found 16,611 matches Last at posn 1,073,714,408
                  9) AATATGCTG found 4,209 matches Last at posn 1,073,714,408
                  10) AATATGCTGG found 1,062 matches Last at posn 1,073,714,408
                  11) AATATGCTGGA found 260 matches Last at posn 1,069,293,115
                  12) AATATGCTGGAG found 64 matches Last at posn 1,069,293,115
                  13) AATATGCTGGAGA found 17 matches Last at posn 1,060,172,682
                  14) AATATGCTGGAGAC found 8 matches Last at posn 946,011,588
                  15) AATATGCTGGAGACT found 2 matches Last at posn 420,603,986
                  16) AATATGCTGGAGACTC found 1 matches Last at posn 336,995,481
                  17) AATATGCTGGAGACTCT found 1 matches Last at posn 336,995,481
                  18) AATATGCTGGAGACTCTG found 0 matches Last at posn 0
                  19) AATATGCTGGAGACTCTGG found 0 matches Last at posn 0
                  20) AATATGCTGGAGACTCTGGA found 0 matches Last at posn 0
                  The first line is a left string of the first 60 chars or so of the searched sequence, just for show.

                  I'm well aware dna searching is much more sophisticated than this but the search time ain't too shabby. Imagine what the speed could be with pointers.

                  '
                  Code:
                  'PBWIN 9.02 - WinApi 05/2008 - XP Pro SP3
                  #Dim All 
                  #Compile Exe  
                  #Optimize SPEED
                  '#Debug Display On'off for production code
                   
                  #Include "WIN32API.INC"
                  '
                  Global hdlg As Dword                
                  Global Start_Matching, Match_Count() As Long
                  Global Sequence, DNA_Pieces(), Look_For()  As String
                  ' 
                  Macro Common_Locals 'Macro easier than retyping and maintains coding consistency
                    Local idd, Stile, r, Row, col, hght, wd, Longest,ctr, ln, ln1, i, i1 As Long
                    Local Match_Len As Long
                    Local tmr As Double
                    Local  u, l, s, tmp() As String
                  End Macro  
                  '
                  Sub Generate_Random_Sequence
                    common_locals                                                                   
                     tmr = Timer
                     ln = 1024 * 1024 * 1024 'billion chars   
                     Sequence$ = Space$(ln) 'create big string
                     ln1 = 4 'number of dna pieces
                     Dim DNA_Pieces(1 To ln1)
                     DNA_Pieces(1) = "A"
                     DNA_Pieces(2) = "C"
                     DNA_Pieces(3) = "G"
                     DNA_Pieces(4) = "T"
                  '   DNA_Pieces(5) = "?"                
                  '
                     Randomize Timer
                  '         
                    For ctr = 1 To ln    'fill in the string with random dna chars   
                        r = Rnd(1, ln1)                      
                  '      If r = 5 Then 'don't want too many unknowns so get another random number                  
                  '         r = Rnd(1, ln1)
                  '         If r = ln1 Then '1 in 125?
                  '            r = Rnd(1, ln1)   
                  '         End If
                  '      End If
                        Mid$(Sequence$, ctr) = DNA_Pieces(r)
                    Next ctr 
                  '
                  '
                    Match_Len = 20 ' longest sequence to match
                  '
                    Dim Look_For$(1 To Match_Len), Match_Count(1 To Match_Len, 1)                           
                  '
                    For ctr = 1 To Match_Len  'now create fragment strings to search for     
                        r = Rnd(1, ln - Match_Len) 'find random starting point for fragment                     
                        r = 1 'nah start at beginning of string
                        Look_For$(ctr) = Mid$(Sequence$, r, ctr)                      
                    Next ctr 
                  '
                  End Sub                   
                  '
                  Sub Find_Matches
                    common_locals                                                                   
                  '  
                    Start_Matching = 6 'minmum number of chars to match
                     tmr = Timer
                  '
                    For ctr = Start_Matching To UBound(Look_For$()) 'check each length for match
                      i = 2  'start at 2nd char so there 
                  Fm10:                                      
                      i = InStr(i, Sequence$, Look_For$(ctr))
                       If i Then 'got a match
                         Incr Match_Count(ctr, 0) 'count it
                         Match_Count(ctr, 1) = i 'where found
                         Incr i 'start looking further on in the Sequence$  *** finds too many matches
                         GoTo FM10  'look for another
                       End If
                    Next ctr
                  '
                  End Sub
                  '
                  '
                  Function PBMain
                    Common_Locals
                  '            
                    tmr = Timer
                       Call Generate_Random_Sequence   
                  '     
                       Find_Matches
                  '     
                    s$ = Left$(Sequence$, 75) & $CrLf & $CrLf 
                    For ctr = Start_Matching To UBound(Look_For$())
                        s$ = s$ & Using$("##)  ", ctr) & _
                                  Using$("\                  \  ", Look_For$(ctr)) & _
                                  Using$(" found #, matches", Match_Count(ctr, 0)) & _
                                  Using$(" Last at posn  #, ", Match_Count(ctr, 1)) & _
                                  $CrLf 
                   
                    Next ctr 
                    l$ = Using$(" Searched #, character sequence in  ,.## seconds",Len(Sequence$), Timer - tmr )
                    ? s$,,l$
                    ClipBoard set text l$ & $CrLf & s$    
                  End Function  'Applikation beenden
                  '
                  =================================================
                  "We grow great by dreams.
                  All big men are dreamers.
                  They see things in the soft haze of a spring day
                  or in the red fire of a long winter's evening.
                  Some of us let these great dreams die,
                  but others nourish and protect them."
                  Woodrow Wilson
                  =================================================
                  Last edited by Gösta H. Lovgren-2; 1 Nov 2009, 11:12 AM.
                  It's a pretty day. I hope you enjoy it.

                  Gösta

                  JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                  LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                  Comment


                  • #29
                    JFF, I converted John's code to PBWin 9.02 for any who are interested:

                    '
                    Code:
                     John Gleason converted to PBWin 9.02 post 22
                    #Compile Exe
                    DefLng a-z
                    '
                    Macro ptf = Print #1, 
                    '
                    Function PBMain () As Long
                       Local searchFragPtr, queryFragPtr As Byte Ptr, printThin As Long
                    '    CLS: Print "Sequence scan"
                       Local tmr As Double, s As String
                       tmr = Timer
                    '    CLS: Print "Sequence scan"
                       Open "z:\Report22.txt" For Output As #1 '******** change z:\ to your preference
                        ptf  "Sequence scan post 22" & $CrLf & $CrLf 
                    
                        'Make arbitrary TGACU data strings
                        query$ = Space$(155 * 40)
                        search$ = Space$(120 * 40)
                        For ii = 1 To 155 * 40
                           Asc(query$, ii) = Rnd(1, 5)
                           If ii <= 120 * 40 Then Asc(search$, ii) = Rnd(1, 5)
                        Next
                        Replace Any Chr$(1,2,3,4,5) With "TGACU" In query$
                        Replace Any Chr$(1,2,3,4,5) With "TGACU" In search$
                        
                        ' specifying what match Is required, ie At least 21 characters Out Of a Text Window Of 30
                        match_length = 28: mismatch_target = 9: ' maximum number Of tolerated mismatches, ie. 30 minus 21
                        ' Scan individual characters within the Window
                        For a = 1 To Len(query$) - match_length
                            query_fragment$ = Mid$(query$, a, match_length)
                            queryFragPtr = StrPtr(query_fragment$)
                            For b = 1 To Len(search$) - match_length
                                search_fragment$ = Mid$(search$, b, match_length)
                                searchFragPtr = StrPtr(search_fragment$)
                                ' Initalise counters
                                mismatch_total = 0: scan_total = 0: match_found = 0: ' To keep running totals
                                ' Scan subfragments From search$ And query$
                                For c = 1 To match_length
                                    Incr scan_total
                                    'You have a choice here of ASC or even faster, index pointers.
                                    If @searchFragPtr[c-1] <> @queryFragPtr[c-1]  Then Incr mismatch_total
                    '               IF ASC(query_fragment$, c) <> ASC(search_fragment$, c) THEN INCR mismatch_total
                    '               IF UCASE$(mid$(query_fragment$, c, 1)) <> UCASE$(MID$(search_fragment$, c, 1)) THEN INCR mismatch_total
                                    ' early exits If too many mismatches Or match total gone beyond maximum needed
                                    If mismatch_total > mismatch_target Then Exit For
                                    If scan_total - mismatch_total >= match_length - mismatch_target Then match_found = 1: Exit For
                                Next c
                                ' Found a match
                                If match_found Then GoSub Report_match
                            Next b
                        Next a
                        s$ =  $CrLf & Using$("The end took #.# seconds", Timer -tmr)
                        ptf s$
                        Close #1 
                    '    WAITKEY$  
                        ? s$
                        Exit Function
                    Report_match:
                        Incr printThin
                        If (printThin And 0) = 0 Then    'only print 1 of every 512 matches
                           Ptf $CrLf & "Match found at"; a; "in the query sequence and"; b; "in the search sequence"
                           ptf query_fragment$
                           For d = 1 To match_length
                               If UCase$(Mid$(query_fragment$, d, 1)) = UCase$(Mid$(search_fragment$, d, 1)) Then ptf "|"; Else ptf " ";
                           Next d
                           ptf $CrLf & search_fragment$
                                          
                        End If
                    
                    Return
                    End Function
                    '
                    Note you'll have to change the drive designation (z:\) to suit yourself. Z is a Memory disk drive I use to keep from from cluttering up my hard drive(s) withj junk (lotsa luck there, baby).

                    ================================================
                    "A single, seemingly powerless person who dares
                    to cry out the word of truth
                    and to stand behind it
                    with all of his person and all of his life,
                    ready to pay a high price, has, surprisingly,
                    greater power, though formally disenfranchised,
                    than do thousands of anonymous voters."
                    Vaclav Havel
                    ================================================
                    It's a pretty day. I hope you enjoy it.

                    Gösta

                    JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                    LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                    Comment


                    • #30
                      Thanks Gosta, you were up early this morning!
                      I ran the code, certainly fast but it doesn't address the possibility of allowing a proportion of mis-matches as far as I could see (ie. finding a fragment where 20 of the 30 bases match, don't have to be consecutive).

                      I've converted my original code to handle long DNA sequences, and used the pointers suggested by John Gleason. I also converted it to PBWIN. Here is the code:

                      Code:
                      #COMPILE EXE
                      DEFLNG a-z
                      
                      FUNCTION PBMAIN () AS LONG
                      
                          REM Use the faster string pointer method
                          LOCAL searchFragPtr, queryFragPtr AS BYTE PTR
                          
                          search_length = 20000
                          report$ = "Match lengths 20-25, search sequence" + STR$(search_length) + $CRLF + $CRLF
                      
                          REM Make a random sequence using bit-orientated representation
                          query$ = SPACE$(search_length): RANDOMIZE TIMER
                          FOR a = 1 TO search_length: MID$(query$, a, 1) = MID$("1248", RND(1, 4), 1): NEXT a
                      
                          FOR match_length = 20 TO 25
                      
                              search$ = SPACE$(500)
                              FOR a = 1 TO 500: MID$(search$, a, 1) = MID$("1248", RND(1, 4), 1): NEXT a
                      
                              mismatch_target = match_length / 5: REM allow 20% of search string to mismatch
                      
                              REM scan individual characters within the window
                              FOR a = 1 TO LEN(query$) - match_length
                      
                                  query_fragment$ = MID$(query$, a, match_length)
                                  queryFragPtr = STRPTR(query_fragment$)
                                  
                                  FOR b = 1 TO LEN(search$) - match_length
                      
                                      search_fragment$ = MID$(search$, b, match_length)
                                      searchFragPtr = STRPTR(search_fragment$)
                      
                                      REM Initalise counters
                                      mismatch_total = 0: scan_total = 0: match_found = 0: REM to keep running totals
                      
                                      REM Scan subfragments from search$ and query$
                                      FOR c = 1 TO match_length
                      
                                          INCR scan_total
                                          IF @searchFragPtr[c-1] <> @queryFragPtr[c-1]  THEN INCR mismatch_total: REM Strictly speaking these values should be "ANDed"
                      
                                          REM early exit if too many mismatches or match total gone beyond maximum needed
                                          IF mismatch_total > mismatch_target THEN EXIT FOR
                                          IF scan_total - mismatch_total >= match_length - mismatch_target THEN match_found = 1: EXIT FOR
                      
                                      NEXT c
                      
                                      REM Found a match
                                      IF match_found THEN GOSUB Report_match
                      
                                  NEXT b
                      
                              NEXT a
                      
                          NEXT match_length
                      
                          report$ = report$ + "The End!" + $CRLF: MSGBOX report$, %MB_ICONINFORMATION, "Search Results"
                      
                      
                          EXIT FUNCTION
                      
                      Report_match:
                          report$ = report$ + "Query length" + STR$(match_length) + ", mismatches allowed" + STR$(match_length - mismatch_target) + $CRLF
                          report$ = report$ + "Match at" + STR$(a) + " in query sequence and" + STR$(b) + " in search sequence" + $CRLF
                          
                          report$ = report$ + query_fragment$ + $CRLF + search_fragment$ + $CRLF + $CRLF
                      
                      RETURN
                      
                      END FUNCTION
                      It still runs at a fairly leisurely rate. Running through 5 different query lengths (between 20 and 25) with 20% of positions allowed to mismatch, it takes around 10 seconds to scan a 20,000 base search sequence. The number of hits varies between runs, so hope the message box fits on the screen.

                      I don't think I'll try your 1 billion-long sequence at this stage!!

                      Peter
                      Last edited by Peter Simmonds; 2 Nov 2009, 06:18 AM.

                      Comment


                      • #31
                        Peter, by using a divide and conquer technique of repeated small array rather then one-character matching, I *think* the speed could be maybe doubled or more and the time-complexity possibly reduced from O(n^2). It'd be pretty complicated code tho. The questions are: How big does it need to go max string size? and, is say 2 or 3x over the pointer version enough of a speed increase?

                        Comment


                        • #32
                          Hi

                          Yes, well one way would be to convert 8 bases from the search sequence (represented as 4 bit values, ie. 1, 2, 4 and 8) into an integer, and "AND" it with 8 bases from the query sequence. You could then count how many bits are left in the product. I suppose you could use a quad integer for a 16 base search, and of course split a longer search "word" into several integers and test them sequentially.

                          The number of bits left would represent the number of common bases, which would still allow mismatches to be incorporated.

                          Is this what you meant?

                          Peter

                          Comment


                          • #33
                            One more question: Is it usually only TGAC allowed in the strings? Then only 2 bits/letter can be used, and this should be much more efficient.

                            Comment


                            • #34
                              >>Is this what you meant?

                              My idea here is a little different: make 8-base (maybe more, depending) LONG arrays from the query segment, each 65K for 8 bases--which is 2*8 or 16 bits. The number of arrays needed would be query_length / 8, then pre-calculate each array member's # of matches for each subscript. No counting needs to be done, rather just addition of the 3 or 4 member values for up to a 32 length search. And, if my thinking is correct (biiig assumption), it should take 8 bases at a time. Now that might make it 8x faster, but I think it may still be n² time.

                              Comment


                              • #35
                                Originally posted by Peter Simmonds View Post
                                Thanks Gosta, you were up early this morning!
                                I ran the code, certainly fast but it doesn't address the possibility of allowing a proportion of mis-matches as far as I could see (ie. finding a fragment where 20 of the 30 bases match, don't have to be consecutive).
                                What I don't uinderstand(so far, and among many other things as well) is the "don't have to be consecutive". Here's a result from John's Code:

                                Sequence scan John Gleason
                                Match found at 1449 in the query sequence and 4322 in the search sequence
                                CUUUGAGACACGTTTTUUAGAGCATCTA
                                -|-||-|---|-|||-||||||-|||||
                                UUCUGCGTGUCATTTUUUAGAGAATCTA
                                Match found at 1450 in the query sequence and 4323 in the search sequence
                                UUUGAGACACGTTTTUUAGAGCATCTAT
                                |-||-|---|-|||-||||||-|||||-
                                UCUGCGTGUCATTTUUUAGAGAATCTAG
                                Match found at 1451 in the query sequence and 4324 in the search sequence
                                UUGAGACACGTTTTUUAGAGCATCTATC
                                -||-|---|-|||-||||||-|||||-|
                                CUGCGTGUCATTTUUUAGAGAATCTAGC
                                Match found at 1452 in the query sequence and 4325 in the search sequence
                                UGAGACACGTTTTUUAGAGCATCTATCT
                                ||-|---|-|||-||||||-|||||-|-
                                UGCGTGUCATTTUUUAGAGAATCTAGCG
                                The end took 10.2 seconds
                                ************
                                in each case it appears (I haven't counted) that 19 of 29 components match. Is that what you are looking for?

                                I don't think I'll try your 1 billion-long sequence at this stage!!

                                Peter
                                I just put that toegther for fun and boredom. My curiosity is to what constitutes a DNA match in language or terms I can understand.

                                It would be neat if we could develop a different search algo here. John, Paul and quite a few other speed freaks here are very very good at that. Me I ain't so hot on algos but I do provide stimulus once in awhile. {grin} Ain't much on speed either, more of a brute force guy (what we fishermen used to call a plugger).

                                =============================
                                "Bachelors have consciences,
                                married men have wives."
                                Samuel Boswell (1709-)
                                =============================
                                Last edited by Gösta H. Lovgren-2; 2 Nov 2009, 10:32 AM.
                                It's a pretty day. I hope you enjoy it.

                                Gösta

                                JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                Comment


                                • #36
                                  You could use 2 bits, but the problem would be interpreting the results of an AND operation (eg. if C was represented as 11 and G as 01 then a C:C match would leave 2 bits in the answer, but G:G would produce only one). This would prevent the number of matches from being calculated. It would also has the largely theoretical disadvantage of being unable to match unresolved bases. Using one bit for each of the four bases allows every combination of base to be coded and correctly scanned. If the sequence has a purine, coded as 0101 (A or G), this will match G (0100) and A (0001) but not C (0010) when "ANDed".

                                  I've read the last message few times and I can't quite see what you're doing. Can you explain it again?

                                  Thanks

                                  Peter

                                  Comment


                                  • #37
                                    Yes, I'll have to work up a little code snippet to show you what I mean, and it's slightly possible, now I'm thinking, that it may even be n(ln(n)) time. btw, it's almost just like your multi-dim array technique, but just 1 dim used several times

                                    Comment


                                    • #38
                                      Gosta

                                      Yes, search with mismatches means that a match is specified by a search window length (30 in some of the examples above) and an allowed number of mismatches (typically 20% for DNA searches between somewhat divergent genome sequences). Because each and any of the mismatches can be anywhere in the (30 base) search window, it immediately rules out INSTR, REGRPL and all the built search methods in Powerbasic. Counting the number of matches between a 30 base search fragment and a 30 base query fragment adds a rather long third loop into what is already a likely inefficient pair of loops that slowly increment through the search string and the query string, and which is why everything is horribly slow!

                                      Try the code I posted this morning to see what I mean!

                                      Comment


                                      • #39
                                        I'll show code as I'm progressing. Maybe it'll help explain it.
                                        Code:
                                        #COMPILE EXE
                                        #DIM ALL
                                        
                                        FUNCTION PBMAIN () AS LONG
                                        
                                            queryStr = "GTACGCGATACTGAGCTCGAATCGTATTAGCA"
                                            searcStr = "TCGATGCATACGAGTCCTGAGCTAGCTCGAATCGTATTAGCAGCTCCGGCCTATCGATCGTAGCCATGCTCTAGCTCCATCTAGGCTAGCCATACGCGCTATAGCCTATATACCGCTATTGCCGCTCGAAATCGCTCGCTATAGCCT"
                                            '1)Make query arrays: 4 total, 16-bits, each takes 8 bases from queryStr in order
                                            DIM qArr1(&h0ffff) AS LONG   'may be faster as BYTE, so test both
                                            FOR ii = 0 TO &h0ffff        'load arr1 with GTACGCGA (1st 8 bases) frequencies. A=00; C=01; G=10; T=11 binary
                                                                         'therefore GTACGCGA = &b01011000110011000. Start from AAAAAAAA and go to TTTTTTTT
                                                                         'calculating all permutations and match numbers in between. eg. qArr(&b01011000110011000) = 8
                                                                         'because GTACGCGA = &b01011000110011000 so all 8 bases match.
                                                                                
                                        
                                        END FUNCTION

                                        Comment


                                        • #40
                                          Peter, if you have some minutes, I could use a function.
                                          Code:
                                          #COMPILE EXE
                                          #DIM ALL
                                          
                                          FUNCTION PBMAIN () AS LONG
                                              DIM baseLet8, temp8 AS STRING
                                              
                                              queryStr = "GTACGCGATACTGAGCTCGAATCGTATTAGCA"
                                              searcStr = "TCGATGCATACGAGTCCTGAGCTAGCTCGAATCGTATTAGCAGCTCCGGCCTATCGATCGTAGCCATGCTCTAGCTCCATCTAGGCTAGCCATACGCGCTATAGCCTATATACCGCTATTGCCGCTCGAAATCGCTCGCTATAGCCT"
                                              bases8 = "AAAAAAAA"
                                              
                                              '1)Make query arrays: 4 total, 16-bits, each takes 8 bases from queryStr in order
                                              DIM qArr1(&h0ffff) AS LONG   'may be faster as BYTE, so test both
                                              query8 = "GTACGCGA"
                                              FOR ii = 0 TO &h0ffff        'load arr1 with GTACGCGA (1st 8 bases) frequencies. A=00; C=01; G=10; T=11 binary
                                                                           'therefore GTACGCGA = &b01011000110011000. Start from AAAAAAAA and go to TTTTTTTT
                                                                           'calculating all permutations and match numbers in between. eg. qArr(&b01011000110011000) = 8
                                                                           'because GTACGCGA = &b01011000110011000 so all 8 bases match.
                                                 temp8 = bases8
                                                 REPLACE "A" WITH "00" IN temp8
                                                 REPLACE "C" WITH "01" IN temp8
                                                 REPLACE "G" WITH "10" IN temp8
                                                 REPLACE "T" WITH "11" IN temp8
                                                 
                                                 'now calculate how many matches there are between GTACGCGA and say, AAAAAAAA
                                                 matches = matchesIn8(query8, bases8)
                                                 matches = 2 'for GTACGCGA and say, AAAAAAAA. A little help! LOL I don't have the function that does this yet.
                                                             'I just can see by looking at it that it = 2
                                                             
                                                 'and therefore the 1st member
                                                 qArr1(VAL("&b" & temp8)) = matches 'this is like qArr1(AAAAAAAA)
                                                 
                                          
                                          END FUNCTION
                                          
                                          FUNCTION matchesIn8(query8 AS STRING, bases8 AS STRING) AS LONG
                                                'Peter, can you fill this in? It'd save me a chunk o time :D
                                          END FUNCTION

                                          Comment

                                          Working...
                                          X