Announcement

Collapse
No announcement yet.

DNA Part Deux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DNA Part Deux

    Continued from http://www.powerbasic.com/support/pb...ad.php?t=41843

    Thread was getting way too long. I created a template we can all use to test various methods/optimizations using consistent and real data. To test a different method, just create a new sub using the real life data and variables set upi in PBMain and that's all that will have to be posted here so others can see/test.

    '
    Code:
    '
    '
    ' *******************************************************
    '
    ' Download and unzip Data.zip at [URL]http://www.swedesdock.com/powerbasic/Data.zip[/URL] (6 meg) 
    '  It currently contains 4 different DNA sequences. 
    '
    ' PBMain Anyone can use To  test methods/optimizations/etceteraetceteraetcetera
    ' using Globally consistent criteria/
    '
    ' All that has To be done Is Create a Sub/Function and use the parameters within the Sub 
    ' that are set up In "Sub Reset_Start_Values" and there will be consistent reporting results.
    '
    '
    ' *******************************************************
    '
    '
    
    'Notes to myself
    ' (Post 38) - "typically 20% for DNA searches between somewhat divergent genome sequences). "
     'thoughts
    ' 1) maybe check only 10 with 2 mismatches. maybe faster. if get a hit then check longer sequence
    ' 2) maybe use value arrays? (123)
    '>>> 3) if first char mismatch, move to next - cannot start with a mismatch? -- No good
    '***************************
    'fragment 28 long with 9 unknowms
    'we know in this example there must be at least 1 3 letter matching sequence  in a matching fragment (28 \ 9)
    'ex 1111111111111111111111111111
    'ex 1101101101101101101101101101 '0 spaced as far apart as possible
    'ex   1  2  3  4  5  6  7  8  9 unknowns
    '*************
    'fragment 30 long with 6 (20%) unknowns
    'we know in this example there must be at least 1 4 letter matching sequence in a matching fragment (30 \ 6)
    'ex 111111111111111111111111111111
    '   111101111011110111101111011110
    'ex     1    2    3    4    5    6
    ' *******************************************************
    'reminder - Registering speeds up nearly 30%
    '  #Register None   
    '  Register b As Long   'used second most
    '  Register c As Long   'used most
    ' *******************************************************
    '**************************
    #Compile Exe
    '#Debug Display On' increases time as much as 50% longer when on 
    DefLng a-z
    '
    Global f_File() As String, _
          f_Title() As String, _
          f_Source() As String, _
          f_Name() As String, _
          f_Sequence() As String, _
          f_Length() As Long
    '
    ''consistency through routines
    Global Type_Run, query, Search As String 
    Global Search_Length, query_Length, Match_Length, Mismatches_Allowed As Long 
    Global Total_Compares, Total_Matches, Mismatch_Total, scan_total, Match_Found  As Long
    Global q1, q2, Early_Exit_Flag As Long
    Global start_time, tmr, tmr1, tmr2 As Single
    '**************************
    Macro pf =  Print #1, 
    '**************************
    '                         
    ' *******************************************************
    'http://www.powerbasic.com/support/pbforums/showthread.php?t=41843&page=4
    Sub y_Post_80_Code
      #Register None   
        Register c As Long
        Register scan_total As Long
    '
        Local searchPtr, queryPtr, a_Ptr, b_Ptr As Byte Ptr
        queryPtr = StrPtr(query$)
        searchPtr = StrPtr(Search$)
        Total_Matches = 0
    ''   -------------------------------------------------------------------------------------------------
        Rem Scan individual characters within the Window
        For a = 1 To Len(query$) - match_Length + 1
            a_Ptr = queryPtr + a
       
            For b = 1 To Len(Search$) - match_Length + 1
                Rem Initalise counters
                Mismatch_Total = 0: scan_total = 0: Match_Found = 0: Rem To keep running totals
                b_Ptr = searchPtr + b
                Rem Scan subfragments From Search$ And query$
                For c = 0 To match_Length - 1
                    Incr Total_Compares 
                    Incr scan_total
                    If ((@b_Ptr[c] - 48) And (@a_Ptr[c] - 48)) = 0 Then Incr Mismatch_Total
    '                If Early_Exit_Flag = 1 And c = 0 And Mismatch_Total = 0 Then Exit For
                    Rem early Exit If too many mismatches Or match total gone beyond maximum needed
                    If Mismatch_Total > Mismatches_Allowed Then Match_Found = 0: Exit For
                    If scan_total - Mismatch_Total >= Matches_Needed Then Match_Found = 1
                Next c
                Rem Found a match
                If Match_Found Then Incr Total_Matches
            Next b
        Next a
    End Sub
    ' *******************************************************
    ' *******************************************************
    ' Data at [URL]http://www.swedesdock.com/powerbasic/Data.zip[/URL] (6 meg)
    Sub z_Get_Fragments
      tmr! = Timer
      ln = 4
      Dim f_File(1 To ln) As String, _
          f_Title(1 To ln) As String, _
          f_Source(1 To ln) As String, _
          f_Name(1 To ln) As String, _
          f_Sequence(1 To ln) As String, _
          f_Length(1 To ln) As Long
      
       f_File$(1) = "C:\Only_My_Programs\DNA\Data\ID_E_Coli.txt"
       f_File$(2) = "C:\Only_My_Programs\DNA\Data\Id_46442158.txt"
       f_File$(3) = "C:\Only_My_Programs\DNA\Data\ID_126212636.txt"
       f_File$(4) = "C:\Only_My_Programs\DNA\Data\ID_126213112.txt"
     
       For ctr = 1 To 4
          z_Parse_Dna_Data_Files(f_File$(ctr), f_Title$(ctr), f_Source$(ctr), f_Name$(ctr), f_Sequence$(ctr), f_Length(ctr))
       Next ctr
      ln = 10
        s$ = Left$(EC_Sequence$, ln) & " - "  & EC_Title$ & " - " & EC_Source$ & " - " & EC_Name$ & " - " & Using$(" #, ", EC_Length) & $CrLf & $CrLf 
      For ctr = LBound(f_File$()) To UBound(f_File$())  
        s$ = s$ & Left$(f_Sequence$(ctr), ln) & " - "  & f_Title$(ctr) & " - " & f_Source$(ctr) & " - " & f_Name$(ctr) & " - " & Using$("   #, ", f_Length(ctr)) & $CrLf & $CrLf 
      Next ctr    
    '   ? s$,,Using$("Done #.## seconds ", Timer - tmr!)
    End Sub
    '
    ' *******************************************************
    '
    Function z_Parse_DNA_Data_Files( _
                          pFileName As String, _
                          pTitle As String, _
                          pSource As String, _
                          pName As String, _
                          pSequence As String, _
                          pSequence_Length As Long) As Long
    '                      Remove_Line_Numbering_Flag As Long, _ 
    
    '*** File structure
    '\Title\ E Coli strand \End Title\
    '\Source\ [URL]http://www.virus-evolution.org/Downloads/[/URL] \End Source\
    '\Name\ E Coli \End Name\
    'GCTTTTCATTCTGACTGCAACGG .......
    '
    ' or files that have line numbers
    '\Title\ Pichia stipitis CBS 6054 chromosome 1 PICSTchr_1.2, whole genome shotgun sequence \End Title\
    '\Source\ [URL]http://www.ncbi.nlm.nih.gov/nuccore/126212636[/URL] \End Source\
    '\Name\ GenBank: AAVQ01000002.1 \End Name\
    '        1 aataaaaagg gggccccaat tatcaatttt cctccaagaa tacaatgaac accttgtcat
    '       61 taataatata tttcctttga ttatgggcaa tttttgttca gaggggggtt tccatttcca
    '...       
    '*** End File structure
      ErrClear
      fle$  = pFileName
      Open fle$ For Binary As #1
         ln = Lof(#1)
         fil$ = Space$(ln) 'make big string
         Get #1, 1, Fil$   'fill it with file contents
         Close #1
    ''' get data
        s$ = "\Title\" : i = Len(s$) + 1 : 'start of Title 
         s1$ = "\End Title\": i1 = InStr(fil$, s1$) - Len(s$) - 1 : 'end of Title 
         pTitle$ = Trim$(Mid$(fil$, i, i1)) ' get title
    '     ? left$(pTitle$, 100),,"1"
     '
        s$ = "\Source\": i = InStr(fil$, s$) + Len(s$):' find source of data
         s1$ = "\End Source\": i1 = InStr(fil$, s1$) - i - 1
         pSource$ = Trim$(Mid$(fil$, i, i1))
    '
        s$ = "\Name\": i = InStr(fil$, s$) + Len(s$): 'find start of sequence name
         s1$ = "\End Name\": i1 = InStr(fil$, s1$) - i - 1
         pName$ = Trim$(Mid$(fil$, i, i1)) ' get title
    '     
         s$ = "\End Name\": i = InStr(fil$, s$) + Len(s$) 'beginning of sequence
        pSequence$ = Mid$(fil$, i + 1) 
    ''' got data
    '***                                
          pSequence$ = Remove$(pSequence$, Any "1234567890")
          pSequence$ = Remove$(pSequence$, $CrLf) 
          pSequence$ = Remove$(pSequence$, $Cr) 'JIC different format
          pSequence$ = Remove$(pSequence$, $Lf) 'JIC
          pSequence$ = Remove$(pSequence$, " ") 
    '***
         pSequence_Length = Len(pSequence)
         pSequence$ = UCase$(pSequence$) 'all same case
         Replace Any "ACGT" With "1248" In pSequence$ 'use numbers
         If Err Then
           ? fle$ & $CrLf & $CrLf & Error$(Err),, Using$(" Error # in ", Err) & FuncName$
         End If
    '***     
        Function = Err
    End Function 
    '
    ' *******************************************************
    '      
    Function y_PDpost118 () As Long
    #Register None
     
        Local searchPtr, queryPtr, a_Ptr, b_Ptr As Byte Ptr
        matches_needed = match_length - Mismatches_Allowed
        queryPtr = StrPtr(query$)
        For a = 0 To query_length-1: @queryPtr[a] = @queryPtr[a] -48:Next
        query$ = query$ +  String$(32,Chr$(16))   'add a 32 byte tail to stop overrun when 32 bytes are taken and less bytes are valid
        queryPtr = StrPtr(query$)  'update string pointer because it changed on the above line
        searchPtr = StrPtr(search$)
        For a = 0 To search_length-1: @searchPtr[a] = @searchPtr[a] - 48: Next
        search$ = search$ +  String$(32,Chr$(16)) 'add a 32 byte tail to stop overrun when 32 bytes are taken and less bytes are valid
        searchPtr = StrPtr(search$)  'update string pointer because it changed on the above line
            total_matches = 0
            'generate the masks used for non-multiples of 8 sequence lengths
            Mask0&& = &hffffffffffffffff
            Mask1&& = &hffffffffffffffff
            Mask2&& = &hffffffffffffffff
            Mask3&& = &hffffffffffffffff
            Select Case match_length
                Case 1 To 8
                    Shift Left mask0&&, 8 * match_length
                Case 9 To 16
                    Mask0&& = &h0
                    Shift Left mask1&&,8 * (match_length -8)
                Case 17 To 24
                    Mask0&& = &h0
                    Mask1&& = &h0
                    Shift Left mask2&&,8 * (match_length -16)
                Case 25 To 32
                    Mask0&& = &h0
                    Mask1&& = &h0
                    Mask2&& = &h0
                    Shift Left mask3&&,8 * (match_length -24)
            End Select
            alimit = Len(query$) - match_length -32   'take off 32 as there's 32 bytes of padding at the end
            For b = 1 To Len(search$) - match_length -32 'take off 32 as there's 32 bytes of padding at the end
                b_Ptr = searchPtr + b
                !mov esi,b_ptr    'get pointer into register ready for the loop
                !movq mm0,[esi]         'get 32 bytes to compare for this attempt
                !movq mm1,[esi+8]
                !movq mm2,[esi+16]
                !movq mm3,[esi+24]
             '   !por mm0,mask0&&        'mask off the bytes that aren't to be included
                !por mm1,mask1&&
                !por mm2,mask2&&
                !por mm3,mask3&&
                !mov edi,queryPtr       'get pointer to other string
                !mov edx,Mismatches_Allowed
                ' FOR a = LEN(query$) - match_length to 1 step -1
                !mov ecx, alimit
         
    #Align 16
                aloop:
                Incr Total_Compares '<<<<<<< GHL added this
                !movq mm4,[edi+ecx]         'get 32 bytes to check
                !movq mm5,[edi+ecx+8]
                !movq mm6,[edi+ecx+16]
                !movq mm7,[edi+ecx+24]
                !pand mm4,mm0               'do the 32 byte ANDs
                !pxor mm0,mm0               'clear mm0 after use so we have a zero register
                !pand mm5,mm1
                !pand mm6,mm2
                !pand mm7,mm3
                !pcmpeqb mm4,mm0            'do the 32 bytes worth of compare with zero
                !pcmpeqb mm5,mm0
                !pcmpeqb mm6,mm0
                !pcmpeqb mm7,mm0
                !paddb mm4,mm5              'add up how many mis-matched
                !paddb mm4,mm6
                !paddb mm4,mm7
                !PSADBW mm4,mm0
                !movq mm0,[esi]             'reload the low 8 bytes
                !PExtRW eax,mm4,0           'get the answer into a CPU register, it's negative so -4 means 4 mismatches
                !Add al,dl 'Mismatches_Allowed     'if this overflows, condition is met and carry is set
                !adc total_matches,0        'add carry into total to count matches
                ' NEXT a
                !dec ecx                     'a loop
                !jnz aloop
                !emms                        'restore FPU state after using MMX registers before using FPU again
                
            Next b
    End Function
    '
    ' *******************************************************
    '
    Sub z_Update_Results
      'keep a record of results
       Open "Testing_Results02.txt" For Append As #1 
       s_Title$ = "Time Match  Compares   Comparing   Using                    Searching in                 For"
    '   Print #1, s_Title$ 'only need to print once        
       pf " "
       s$ =       "#.## ##  ###,###,###  (#) and (#)  \                    \    #) \                      \ #) \                      \ "
       Print #1, Using$(s$,  _
                   Timer - Start_Time!, _  
                        Total_Matches, _
                              Total_Compares, _
                                           q1,   q2,  Type_Run$,             q1, f_Name$(q1),            q2, f_Name$(q2))
       Close #1                                             
    '
    End Sub
    '
    ' *******************************************************
    '
    Macro Reset_Start_Values
       search_Length = 6000
       query_Length = 5000
       Match_Length = 22 
       Mismatches_Allowed = 5 
       Total_Matches = 0
       Total_Compares = 0
      'using beginning of sequences to test 
       Search$ = Left$(f_Sequence$(q2), search_Length) '
       query$ =  Left$(f_Sequence$(q1), query_Length)   '
       Start_Time! = Timer
    End Macro              
    '
    ' *******************************************************
    '
    Function PBMain () As Long
        Randomize 1        'get same start each time
        Call z_Get_Fragments 'gets samples to work with
    ' *******************************************************
       q1 = 1 ' 1 = whole e coli sequence
       q2 = 1 'will start as sequence 2 below
    ' *******************************************************
     tmr! = Timer
     While q2 < UBound(f_Sequence$()) 'Rem this if only want to run one test
       Incr q2
       If q2 > UBound(f_Sequence$()) Then  'JIC
          ? Using$("# entered  # is Maxxed out", q2, UBound(f_Sequence$())),, "Bad file number"
          Exit Function
       End If    
      '
       Reset_Start_Values 'standardize searches through all tests
        Type_Run$ = "Post 80 - Peter S. ,"  :  Call y_Post_80_Code: z_Update_Results
      '  
       Reset_Start_Values 'standardize searches through all tests
      '    Early_Exit_Flag = 1: Type_Run$ = "Post 80 - early exit,": Call y_Post_80_Code: z_Update_Results: Early_Exit_Flag = 0
      '
       Reset_Start_Values 'standardize searches through all tests
        Type_Run$ = "PDpost118 - with Total_Compares": Call y_PDpost118: z_Update_Results
      '  
     Wend  'Rem this if only want to run one test
    ' *******************************************************
       
       ? Using$("Done #.## seconds ", Timer - Tmr!),, FuncName$
    '
    End Function
    '
    Sample comparison. Note While Early Exit takes less time an makes fewer compare, it doesn't find any matches so that theory can be tossedm or at least re-examined.

    Time Match Compares Comparing Using Searching in For
    3.38 2 236,308,770 (1) and (2) Post 80 - Peter S. 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
    2.69 0 178,553,581 (1) and (2) Post 80 - early exi 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
    ============================
    "I love criticism
    as it's unqualified praise."
    Noel Coward
    ============================
    Last edited by Gösta H. Lovgren-2; 19 Nov 2009, 09:28 PM.
    It's a pretty day. I hope you enjoy it.

    Gösta

    JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
    LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

  • #2
    Gösta:
    See post 128 in the other thread.
    In the above test you are missing 10958 comparisons.
    Rod
    In some future era, dark matter and dark energy will only be found in Astronomy's Dark Ages.

    Comment


    • #3
      Gösta,
      try this one. It's identical in almost every way to the original post 118 in the other thread.
      There is only 1 significant change (twice). I mistakenly subtracted 48 from each characters in the 2 strings counting from 1 to length, it should have been from 0 to length-1 when pointers are used.
      That's now fixed but it doesn't alter the result in any way as there were no matches at the first characters.

      As far as I can tell this code gives the same results as your code.

      Paul.
      Code:
      FUNCTION PDpost118 () AS LONG
      #REGISTER NONE
       
          LOCAL searchPtr, queryPtr, a_Ptr, b_Ptr AS BYTE PTR
          matches_needed = match_length - Mismatches_Allowed
          queryPtr = STRPTR(query$)
      
          FOR a = 0 TO query_length-1: @queryPtr[a] = @queryPtr[a] -48:NEXT
      
          query$ = query$ +  STRING$(32,CHR$(16))   'add a 32 byte tail to stop overrun when 32 bytes are taken and less bytes are valid
          queryPtr = STRPTR(query$)  'update string pointer because it changed on the above line
      
          searchPtr = STRPTR(search$)
          FOR a = 0 TO search_length-1: @searchPtr[a] = @searchPtr[a] - 48: NEXT
      
          search$ = search$ +  STRING$(32,CHR$(16)) 'add a 32 byte tail to stop overrun when 32 bytes are taken and less bytes are valid
          searchPtr = STRPTR(search$)  'update string pointer because it changed on the above line
      
              total_matches = 0
      
              'generate the masks used for non-multiples of 8 sequence lengths
              Mask0&& = &hffffffffffffffff
              Mask1&& = &hffffffffffffffff
              Mask2&& = &hffffffffffffffff
              Mask3&& = &hffffffffffffffff
      
              SELECT CASE match_length
                  CASE 1 TO 8
                      SHIFT LEFT mask0&&, 8 * match_length
      
                  CASE 9 TO 16
                      Mask0&& = &h0
                      SHIFT LEFT mask1&&,8 * (match_length -8)
      
                  CASE 17 TO 24
                      Mask0&& = &h0
                      Mask1&& = &h0
                      SHIFT LEFT mask2&&,8 * (match_length -16)
      
                  CASE 25 TO 32
                      Mask0&& = &h0
                      Mask1&& = &h0
                      Mask2&& = &h0
                      SHIFT LEFT mask3&&,8 * (match_length -24)
      
              END SELECT
      
              alimit = LEN(query$) - match_length -32   'take off 32 as there's 32 bytes of padding at the end
      
              FOR b = 1 TO LEN(search$) - match_length -32 'take off 32 as there's 32 bytes of padding at the end
      
                  b_Ptr = searchPtr + b
      
                  !mov esi,b_ptr    'get pointer into register ready for the loop
                  !movq mm0,[esi]         'get 32 bytes to compare for this attempt
                  !movq mm1,[esi+8]
                  !movq mm2,[esi+16]
                  !movq mm3,[esi+24]
      
               '   !por mm0,mask0&&        'mask off the bytes that aren't to be included
                  !por mm1,mask1&&
                  !por mm2,mask2&&
                  !por mm3,mask3&&
      
                  !mov edi,queryPtr       'get pointer to other string
      
                  !mov edx,Mismatches_Allowed
      
                  ' FOR a = LEN(query$) - match_length to 1 step -1
                  !mov ecx, alimit
           
      #ALIGN 16
                  aloop:
      
                  !movq mm4,[edi+ecx]         'get 32 bytes to check
                  !movq mm5,[edi+ecx+8]
                  !movq mm6,[edi+ecx+16]
                  !movq mm7,[edi+ecx+24]
      
                  !pand mm4,mm0               'do the 32 byte ANDs
                  !pxor mm0,mm0               'clear mm0 after use so we have a zero register
                  !pand mm5,mm1
                  !pand mm6,mm2
                  !pand mm7,mm3
      
                  !pcmpeqb mm4,mm0            'do the 32 bytes worth of compare with zero
                  !pcmpeqb mm5,mm0
                  !pcmpeqb mm6,mm0
                  !pcmpeqb mm7,mm0
      
                  !paddb mm4,mm5              'add up how many mis-matched
                  !paddb mm4,mm6
                  !paddb mm4,mm7
                  !psadbw mm4,mm0
                  !movq mm0,[esi]             'reload the low 8 bytes
      
                  !pextrw eax,mm4,0           'get the answer into a CPU register, it's negative so -4 means 4 mismatches
      
                  !add al,dl 'Mismatches_Allowed     'if this overflows, condition is met and carry is set
      
                  !adc total_matches,0        'add carry into total to count matches
      
                  ' NEXT a
                  !dec ecx                     'a loop
                  !jnz aloop
                  !emms                        'restore FPU state after using MMX registers before using FPU again
                  
              NEXT b
      
      END FUNCTION

      Comment


      • #4
        Originally posted by Rodney Hicks View Post
        Gösta:
        See post 128 in the other thread.
        In the above test you are missing 10958 comparisons.
        Good catch, Rod. I added + 1 to each of the above loops: Results:

        Time Match Compares Comparing Using Searching in For
        3.38 2 236,308,770 (1) and (2) Post 80 - Peter S. 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
        2.69 0 178,553,581 (1) and (2) Post 80 - early exi 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
        3.36 2 236,398,402 (1) and (2) Post 80 - With + 1 1) Ecoli Sequence 2) GenBank: AACQ01000024.1


        Only I come up with almost 90,000 additional compares. (.2 of a sec longer). No additional matches though.

        ======================================
        "People demand freedom of speech
        to make up for the freedom of thought
        which they avoid."
        Soren Aabye Kierkegaard (1813-1855)
        ======================================
        It's a pretty day. I hope you enjoy it.

        Gösta

        JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
        LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

        Comment


        • #5
          And the clear winner is:

          Originally posted by Paul Dixon View Post
          Gösta,
          try this one. It's identical in almost every way to the original post 118 in the other thread.
          And the clear winner is (so far):

          Time Match Compares Comparing Using Searching in For
          3.38 2 236,308,770 (1) and (2) Post 80 - Peter S. 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
          2.69 0 178,553,581 (1) and (2) Post 80 - early exi 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
          3.36 2 236,398,402 (1) and (2) Post 80 - With + 1 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
          0.34 2 0 (1) and (2) PDpost118 - no Total_Compares 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
          0.35 2 29,758,484 (1) and (2) PDpost118 Incr Total_Compares 1) Ecoli Sequence 2) GenBank: AACQ01000024.1

          You didn't have a "Total_Compares" in your code (see first run) so I added it here:
          Code:
            
          #Align 16
                      aloop:
                      Incr Total_Compares '[COLOR=red]<<<<<<< GHL added this[/COLOR]
          for the second run(s). It oonly add 1/100 of a sec to routine. I don't know if I have it in the right place though. It only records 1/10 as many as Post80code though it finds the identical number of matches in each of the 3 sequences tested:

          3.36 2 236,398,402 (1) and (2) Post 80 - Peter S. , 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
          0.35 2 29,758,484 (1) and (2) PDpost118 - with Total 1) Ecoli Sequence 2) GenBank: AACQ01000024.1
          3.39 8 237,357,877 (1) and (3) Post 80 - Peter S. , 1) Ecoli Sequence 3) GenBank: AAVQ01000002.1
          0.35 8 29,758,484 (1) and (3) PDpost118 - with Total 1) Ecoli Sequence 3) GenBank: AAVQ01000002.1
          3.39 8 237,248,346 (1) and (4) Post 80 - Peter S. , 1) Ecoli Sequence 4) Gene ID:4850976
          0.34 8 29,758,484 (1) and (4) PDpost118 - with Total 1) Ecoli Sequence 4) Gene ID:4850976

          Something else says to me I have the TC in the wrong place as TC varies with the sequence tested (2, 3 or 4) in post80 runs as is to be expected, but not in post118. I don't even know if TC is relevant or not as long as the answers match regardless of sequences tested. It does convey how much work is being done though.

          Code above (#1) amended to allow multiple runs:

          Now all anyone has to do is add a call to his Sub/Function in PBMain to compare it to other codes/tests.

          ========================================
          None are more hopelessly enslaved than
          those who falsely believe they are free.
          ~ Johann Wolfgang von Goethe
          ========================================
          Last edited by Gösta H. Lovgren-2; 19 Nov 2009, 09:27 PM.
          It's a pretty day. I hope you enjoy it.

          Gösta

          JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
          LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

          Comment


          • #6
            Gösta,
            you have 3 loops, a, b and c.
            Your program increments total_compares on each iteration of the inner most, c loop so you're counting the number of individual characters that are compared.

            In my code there is no c loop, it has been replaced by a single block compare of all of the bytes for this attempt so the total_compares no longer has any meaning. In effect, my code does match_Length compares each time but if you were to use total_compares += match_Length in place of Incr Total_Compares then my code would always give a higher compare count because yours drops out early when too many mismatches are found whereas my code does all 28 or 22 or whatever compares every time.
            Although it appears I'm then doing more work, it's faster to do it this way.

            As a side effect, if you ever need to do many scans on the same data, first with a match_length of 22 then 23 then 29 (or any set of match_lengths) then my method just needs a couple of extra compares at the end of the matching to allow a single scan to produce match counts for every match_length without needing to scan the entire data again.


            Paul.

            Comment


            • #7
              Originally posted by Paul Dixon View Post
              Gösta,
              you have 3 loops, a, b and c.
              Your program increments total_compares on each iteration of the inner most, c loop so you're counting the number of individual characters that are compared.

              In my code there is no c loop, it has been replaced by a single block compare of all of the bytes for this attempt so the total_compares no longer has any meaning. In effect, my code does match_Length compares each time but if you were to use total_compares += match_Length in place of Incr Total_Compares then my code would always give a higher compare count because yours drops out early when too many mismatches are found whereas my code does all 28 or 22 or whatever compares every time.
              I figured it was something like that Paul. As I said I what matters is the final product (right answers). I do however like to show how much work is being done. What I'll do, just for fun, is make it TC + 22 (or whatever) which will put it over a billion. Wow!. In tenths of a second. Double Wow!!

              A few short weeks ago, an example of a DNA comparison routine was posted here that took over 5 1/2 minutes to run (338 seconds). The routine has been a standard for years. The same comparison now takes tenths of a second.

              Boys, who could not be impressed by that?

              =============================================
              People change and forget to tell each other.
              (Lillian Hellman)
              =============================================

              Later:
              Code:
              ...#Align 16
                          aloop:
                          Total_Compares = Total_Compares + Match_Length '<<<<<<< GHL added this
              ...
               
              Yields:
               
              Time Match  Compares   Comparing   Using                     Searching in                 For
              3.39  8  237,248,346  (1) and (4)  Post 80 - Peter S. ,      1) Ecoli Sequence           4) Gene ID:4850976          
              0.36  8  654,686,648  (1) and (4)  PDpost118 - with Total    1) Ecoli Sequence           4) Gene ID:4850976
              Last edited by Gösta H. Lovgren-2; 20 Nov 2009, 09:06 AM.
              It's a pretty day. I hope you enjoy it.

              Gösta

              JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
              LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

              Comment


              • #8
                if you want to do like-for-like measurements of the work done then I'd say the sensible thing is not to count every character comparison but to count each 22 character sequence comparison, after all it's the short sequence that either matches or fails to match.
                If you move your INCR Total_Compares to immediatelly after the FOR B= instead of the FOR C = and on my code put !INC Total_Compares immediately after the aloop:

                Both will then return the same amount of compares done.

                Paul.
                Edit:
                the above suggestion is probably not worthwhile after all. This is meant to be an optimised search so instead of incrementing a count every time around the loop we should just calculate the number of comparisons upfront and save time:

                Total_comparisons = (search_Length - Match_Length) * (query_Length - Match_Length )
                Last edited by Paul Dixon; 20 Nov 2009, 09:27 AM.

                Comment


                • #9
                  Originally posted by Paul Dixon View Post
                  if you want to do like-for-like measurements of the work done then I'd say the sensible thing is not to count every character comparison but to count each 22 character sequence comparison, after all it's the short sequence that either matches or fails to match.

                  Paul.
                  I don't know that I agree. We already know the human "work done" is comparing two strings, x number of characters at a time. What I'm interested in is how much work the computer does. It's a marvel to me it can do a half billion things in 3/10 second.

                  And I think it's really cool a standard testing technique in a DNA lab for years has now been speeded up by a factor of at least 40 by some none DNA guys just fooling around in their spare time. And some are only "hobbyists" no less. {laughing}
                  It's a pretty day. I hope you enjoy it.

                  Gösta

                  JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                  LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                  Comment


                  • #10
                    Gösta ,
                    a standard testing technique in a DNA lab for years has now been speeded up by a factor of at least 40
                    Now we just have to hope that Peter is involved in virus research for some benevolent reason and not to create a biological weapon which will wipe us all out 40 times sooner that it would have done without our help!

                    Paul.

                    Comment


                    • #11
                      Not really meaning to put a damper on the rejoicing, you folks have earned it, but... when this was first being discussed, there was the mention of 'unknowns'. Is the code that has been developed capable of handling these creatures and still provide an accurate result? I know that the instances of unknowns could be very small, and by and of themselves are not important, they still have to be considered when handling the data.
                      There may be other considerations regarding these creatures as well.

                      I do hope a lot more testing is done before anyone relies too heavily on the code, but you folks have done well.

                      I guess we know now why PB doesn't have a mismatch feature in the INSTR function(Peter's first post on this subject.)
                      Rod
                      In some future era, dark matter and dark energy will only be found in Astronomy's Dark Ages.

                      Comment


                      • #12
                        Rodney,
                        to me this wasn't a genetics problem, it was a programming problem so I can only assume that Peter knows what he wants and that this code does it.
                        Since all of the programs compare bit patterns of 1,2,4,8 for A,G,C,T then a 15 will match ANY and a zero will match NONE. That should take care of any unknowns as a decision will have to be made before the search as to whether an unknown will match or not. If any other approach was required for unknowns then I'd assume Peter would have mentioned it.

                        Paul

                        Comment


                        • #13
                          Latecomer question

                          I've been away from the PB forums for a long time, and I came back today to catch up... I read the preceeding thread and this one with great interest, since I love pattern-matching problems.

                          Had I been involved at the start, I would have been reluctant to join in on solving the programming/pattern/speed problem due to my lack of understanding of the problem domain. I have questions that I haven't see addressed at all. Unfortunately, they spring from my having just enough information to get me into trouble... I know I could be totally wrong about genetics, but until I understand the basics, I feel very uncomfortable trying to develop a solution.

                          So let me try to clear my concerns by proffering what little I know, then asking if each concern is valid.

                          1. Alignment
                          One difficulty I have with the string comparisons is that the actual DNA strands might not be perfectly aligned, so we need to first find a way to align them before we can judge if they compare. And the way to do the alignment is to find comparable substrings...

                          Doesn't the "compare for DNA match" process depend upon (and occur after) a "compare for alignment" process? (One step further, how do we know the DNA is even comparable? Is it necessary to "compare for same species" first? I know, these last 2 questions are outside the domain of programming solutions...)

                          It seems to me that the programming solution has to be more robust in order to first find the proper alignment... or perhaps, it needs to be continually testing various alignments to see which match up "better"... Or, do we just run the same code for some number of iterations, each time starting a little further into the target string each time?


                          2. Orientation
                          I don't know how important this is, but I wonder if there is a right-to-left orientation that needs to be accounted for.

                          Do we need to run the comparison against the target in both directions before we can determine a match? Is there a "directional" process that needs to be run?


                          3. Variability
                          It is my understanding that, for many reasons, DNA strands can vary in length. One possible reason is the "stuttering gene". (I read about this in an article that discussed research being done on schitzophrenia.) In essence, after several generations, a particular gene in the DNA of a descendant can repeat. Thus where the ancestor had one G, a descendant several generations later might have GGG in the "same" place in the DNA. (I don't recall if "G" is subject to stuttering; I'm using it here as a generic symbol.)

                          So in a comparison of AAGT against AAGGGT, we might not want to say that these substrings don't match. (I'm using these short substrings as examples for this discussion; I'm sure that in real DNA, there are many more factors.)

                          How much variability is acceptable? How is it measured?


                          4. Proximity
                          It is my (probably wrong) understanding that long sequences of genes can be interrupted by other gene sequences. I don't remember exactly what the interruptions are called or are caused by, but I seem to recall that it has something to do with mutation (or viral invasion?).

                          If I'm recalling even a little bit correctly, then these "interruptions" would require "wildcard" matching, where a sequence of AAGT and a neighbor of TTAC might be adjacent in one string but separated in another. (Again, I use short sequences only for ease in this discussion.)

                          If "interruptions" occur, then how do we know "how far away" to look for the next match? In other words:

                          Looking for: AAGTTTAC
                          in: CCACAAGTCCATATGTTAC ...

                          +++++++++
                          It seems to me that the programming problem takes issues 1, 2, and 3 as "given" - we are presuming that the strings are aligned at the "proper starting point", they are being read in the proper direction, and that variations will be detected.


                          But to my (admittedly limited) thinking, wouldn't the presence of variations (and/or proximity problems) throw off the subsequent comparisions? Doesn't the solution have to account for that?

                          In order to account for Variability, shouldn't it have a "placeholder/look-ahead/resume" strategy?


                          I know and admit the limits of my knowledge of DNA and genetics, but I do have some sense that in terms of pattern matching, we may be missing an important strategic question.

                          -John

                          Comment


                          • #14
                            Excellent questions, John. One of the reasons I moved the thread over here was to maybe get into some more nuanced searching. I haven't posted anything yet as there are still a few things running around in my head to sort out.

                            What I was hoping we could get Peter to do, as he's the only DNA experienced guy here, is point to several actual dna sequences, say 10, or 100 or ... And that 2 or 3 or ?? were known to be related. Father/mother/daughter/son or something. All he would have to do is tell us the "rules" of a (near) match.

                            Then we could, use that as a platform/contest/experiment to see if we couldn't come up with new/different/faster matching solutions. After all, IIRC how this thread started, Peter's standard well established for a decade or more routine was improved by a factor of 100? (from 335 secs to 5 secs or maybe .5 sec) just by a little fooling around by a few amateurs. We did that only because we didn't know the routine couldn't be improved. And the routine users knew it couldn't be improved so didn't waste time trying and went on to other stuff.

                            ======================================
                            "The secret of success is
                            to know something nobody else knows."
                            Aristotle Onassis (1906-1975)
                            ======================================
                            It's a pretty day. I hope you enjoy it.

                            Gösta

                            JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                            LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                            Comment


                            • #15
                              John,

                              1. Alignment
                              I don't see any alignment problem as the seached DNA is being scanned for all possible chunks of DNA of a given size from the query DNA. There is no attempt to align large blocks of data, just to count how many small chunks match reasonably well.


                              2. Orientation
                              If this is a concern then just reverse one of the strings and do the search again.


                              3. Variability
                              4. Proximity
                              Again, I don't see these as a problem as the search is for a maximum of 32 characters, not for long strings of DNA.


                              Maybe this code is used as a preliminary search to find good candidates to be looked at in more detail.
                              That more detailed look might then take into account the points you mention, but only on a tiny fraction of the data which gets through the initial screening.

                              Paul.

                              Comment


                              • #16
                                Paul,

                                Excellent! That helps me put the coding effort into the bigger context, so I don't have to worry if the result is relevant. (I'll bet you couldn't tell that I'm more of a Systems Analyst than a coder, could you?)

                                Thanks for the clarification!
                                -John


                                ADDED:
                                Found a report that contains info about the points I asked about above. They're not the focus (buried in the narrative) but what a fascinating read!
                                http://www.plosone.org/article/info:...l.pone.0007866
                                (Yes that link looks odd with the icon in the middle, but it clicks through just fine...)
                                Last edited by John Montenigro; 4 Dec 2009, 11:57 AM. Reason: added note about report

                                Comment


                                • #17
                                  Been playing in the water another way (or 2). Someone pointed me to a source of DNA samples so I dl'ed four but I wasn't having much luck with them. Remember the "unknowns" at the start of this? We (at least I) were assuming an unknown was just a different ACGT but then I remembered Peter posting other letters so I explored that avenue.

                                  Along the lines eliminating obvious mismatches I came up with this:
                                  '
                                  Code:
                                  ' Original data (4 sequences) may be downloaded from:
                                  'http://www.SwedesDock.com/powerbasic/DNA/DNA_Data.zip
                                  '   The file format is:
                                  '  \Title\ E Coli strand \End Title\
                                  '  \Source\ [URL]http://www.virus-evolution.org/Downloads/[/URL] \End Source\
                                  '  \Name\ Name used at download site \End Name\
                                  '  GCTTTTCATTCTGACTGCAACGG ....... 'actual sequence
                                  '****************************************************
                                  '  The program starts by loading (4) sequences into a UDT array - Sequences(1 to 4)
                                  '     *see Type Sample below for description.
                                  '
                                  'Questions that occur to me.
                                  '  1} Is there any known beginning/end to a DNA Strand? IOW if one is to search 
                                  '  2} 
                                  '  3} 
                                  '  4} 
                                  '  5} 
                                  '  6} 
                                  '  7} 
                                  '  8} 
                                  '  9} 
                                  ' 10} 
                                  '
                                  '
                                  '**************************
                                  
                                  '
                                  'PBWIN 9.01 - WinApi 05/2008 - XP Pro SP2
                                  '
                                  #Compile Exe 
                                  #Optimize SPEED 'Fly baby, fly
                                  '#Debug Display On 'off in production code
                                  DefLng a-z
                                  '**************************
                                    #Include "WIN32API.INC"
                                  '************************** constants set here
                                   $Data_Location = "C:\Only_My_Programs\DNA\Data\" 'assign your local location
                                  '************************** Globals here
                                  Global Letter_Name() As String
                                  '************************** Types here
                                  Type Sample
                                    Title As String * %Max_Path
                                    Source As String * %Max_Path
                                    Nam As String * %Max_Path
                                    Sequence As String * 1000 * 1000 * 10 '10 million should be plenty
                                    Seq_Length As Long 
                                  End Type   
                                    Global Sequences() As sample
                                  '************************** Macros here
                                  Macro pf =  Print #1, 
                                  '**************************
                                  '
                                  ' *******************************************************
                                  '
                                  Sub GHL_Byte_Array_Test
                                    #Register None   
                                    Register b As Long   'used second most
                                    Register c As Long   'used most
                                  '
                                  'DNA_Source_Strings.Inc - for testing consistency
                                  '
                                  ' *******************************************************
                                  '
                                  '  Get_Strings
                                    'find longest matching sequence in first 1000 chars (long improbable match)
                                    'doesn't improve speed as I thought it might
                                  ''**
                                  '   For ctr = 1000 To 1 Step -1
                                  '       s$ = Mid$(Query$, 1, ctr) 'get snippet to look for
                                  '       i = InStr(Search$, s$)    'look for it
                                  '       If i Then 'eureka!                     
                                  '          Eureka = i
                                  '          query$ = Mid$(query$, ctr) + Left$(query$, ctr-1) 'move first match to beginning and put not matching portion on end to preserve length
                                  '          search$ = Mid$(Search$, i) + Left$(Search$, i-1) 'move first match to beginning and put not matching portion on end to preserve length
                                  '          Exit For
                                  '       End If
                                  '   Next ctr
                                  ' '**
                                     
                                   Mid$(query$, 1, 28) = Search$ 'ensure some matches
                                    
                                      ln = 50
                                      Query_Check$ = Space$(5) & Left$(query$, ln)
                                        Search_Check$ = Space$(5) & Left$(Search$, ln)
                                   'now replace with values
                                      Replace Any "ACGT" With "1248" In Query$
                                      Replace Any "ACGT" With "1248" In Search$
                                   '
                                      q_Check$ = Space$(5) & Left$(Query$, ln)
                                        s_Check$ = Space$(5) & Left$(Search$, ln)
                                   '
                                      Query_length =  Len(Query$)  
                                      search_length = Len(Search$)
                                      match_length = 28 '22
                                      mismatch_target = 9 '5
                                      matches_needed = match_length - mismatch_target
                                      MisMatches_Allowed = mismatch_target
                                    '
                                  
                                  '
                                  ' *******************************************************
                                  ' At this point strings are identical to previous tests
                                     start_time! = Timer
                                     ' the time to assign byte arrays is insignificant, 1 or 2 hundredths of a second overall
                                      ln = Search_Length
                                      If ln < query_length Then ln = query_length 'find longest for arrays so no over-runs
                                       ln = ln * 2 ' make arrays plenty long JIC - no penalties involved
                                     Dim s(1 To ln) As Byte
                                       For a = 1 To search_length: s(a) = Val(Mid$(search$, a, 1)): Next a
                                          For b = a To ln: s(b) = 9: Next b 'ensure no matches with q()
                                     Dim q(1 To ln) As Byte
                                       For a = 1 To query_length: q(a) = Val(Mid$(query$, a, 1)): Next a
                                          For b = a To ln: q(b) = 0: Next b 'ensure no matches with s()
                                  '        
                                   '**********************   Testing section
                                  '  Test_Print = 1     
                                  '**  
                                     If Test_Print Then 
                                         'for checking in display
                                         Dim lmm(50)
                                        lm$ = String$(50, "-")
                                  '*   
                                        For ctr = 1 To 50
                                  '*   
                                           If s(ctr) = q(ctr) Then 
                                              Mid$(lm$, ctr) = "|"
                                              lmm(ctr) = ctr
                                           End If   
                                  '*   
                                        Next ln 
                                  '*       
                                        lm$ = Space$(5) & lm$ 
                                        
                                       Open "Testing.txt" For Output As #1 
                                        pf Time$                                    
                                        pf Query_Check$: pf q_Check$
                                        pf lm$
                                        pf S_Check$:  pf Search_Check$      
                                        pf "Matches at ";
                                  '*   
                                       For ctr = 1 To UBound(lmm())
                                         If lmm(ctr) Then pf Str$(lmm(ctr));
                                       Next ctr
                                  '*               
                                       pf " "
                                      
                                    End If 
                                  '**  
                                  '************* comparing here
                                     start_time! = Timer
                                     For a = 1 To search_length - match_length   'piece to search                                    
                                       If Test_Print Then  Pf Using$(" ####}  # # = ##  Total Matches ##,###", a+1, q(a+1), s(a+1), q(a+1) = s(a+1), qs_TTl )
                                  '**    
                                       For b = 0 To query_length - match_length ' 
                                         Reset mm_ctr, scan_total, Matched_flag, Match_ctr 'set to 0
                                         ab = a + b  ' save an add in inner loop (Thanks, Paul)
                                         If Test_Print Then pf  "  Compares  (ab+c)  a  b  c  q() s()   Score   MisMatches  Matched  "
                                  '**  
                                         For c = 0 To match_length - 1
                                     '      abc = ab + c ' save an add in inner loop - actually adds .3 second
                                           If mm_ctr  > MisMatches_Allowed Then  Exit For  'more mismatches than allowed
                                           Incr Compares_Made  'count total checks  'time taken is insignificant, 1 or 2 hundredths of a second overall
                                           Incr scan_total
                                            
                                           'If q(abc) <> s(abc) Then Incr mm_ctr Else Incr match_ctr 'adds 3 tenths to routine this way
                                           If q(ab + c) <> s(ab + c) Then Incr mm_ctr Else Incr match_ctr 'no match for this letter but count matches
                                             
                                           If scan_total - mm_ctr => Matches_Needed Then Matched_Flag = 1
                                                                       '"  Compares  (ab+c)  a  b  c  q() s()   Score   MisMatches  Matched  "
                                           If Test_Print Then pf  Using$("###,###,###}   ##  ## ## ##   #   #      ##        ##        ##", _
                                                                         Compares_Made, ab+c, _
                                                                                             a, _
                                                                                                b, _
                                                                                                   c, _
                                                                                                       q(ab+c), _
                                                                                                            s(ab+c), _
                                                                                                                   q(ab+c)=s(ab+c), _
                                                                                                                             mm_ctr, _
                                                                                                                                       Match_ctr)
                                             
                                        Next c
                                   '**
                                         If Test_Print Then pf " "
                                         Total_Matches = Total_Matches + Matched_Flag
                                         If Matched_Flag = 0 Then Incr Not_Matched   'only adds couple tenths
                                       Next b
                                  '**                           
                                       If Test_Print And a = 3 Then Reset Test_Print  'enough don't need 100's millions lines
                                     Next a                                      
                                  '**   
                                     Close #1
                                  '
                                  ' *******************************************************
                                  '
                                   Type_Run$ = Space$(5) & "Using Byte Arrays 1st 28 chars same in each string"
                                     Open "DNA_Testing_Results01.txt" For Append As #1 
                                        'Print #1, Query_Check$: Print #1, q_Check$
                                        'Print #1, Search_Check$: Print #1, s_Check$
                                        Print #1, Type_Run$
                                        Row$ = "###,###,###  ##.## Seconds  QL = ##,###  SL = ##,### | found #, Matches | #, Not Matched | MatchLen = ## |  Mismatches allowed = ## |"
                                        Print #1, Using$(Row$, _
                                                Compares_Made, _
                                                             Timer - start_time!,_
                                                                            query_length, _
                                                                                          search_length, _
                                                                                                      Total_Matches, _              
                                                                                                                         Not_Matched, _
                                                                                                                                         match_length, _
                                                                                                                                                           MisMatches_Allowed)
                                        Print #1, " " : Close #1 
                                    ? "The end"
                                  End Sub                                               
                                  '
                                  ' *******************************************************
                                  '
                                  Sub GHL_Count_Letters
                                   Local tmr As Single
                                   Local ltrs() As Long
                                   Tmr = Timer
                                   '
                                    ub = UBound(Sequences()) 'easier
                                    lb = LBound(Sequences())
                                    ReDim ltrs(1 To ub, 255) 'hold letter count
                                  '
                                  '******
                                    While Sequence_Counter < UB
                                      Incr Sequence_Counter
                                     '
                                      q$ = Trim$(Sequences(Sequence_Counter).Sequence) 
                                      lngth = Len(q$) 
                                  '****
                                      ' count letters
                                       For ctr = 1 To lngth
                                         a = Asc(Mid$(q$, ctr, 1))
                                         Incr ltrs(Sequence_Counter, a)
                                       Next ctr 
                                  '*****
                                    Wend
                                  '****   
                                     'now create the report 
                                    Open "Counting_Letters.txt" For Output As #1
                                    n = 120 'max line length for report
                                  '****
                                    For ctr = lb To ub
                                       pf , Using$("Sequence (#) = #, Letters" , ctr, Sequences(ctr).Seq_Length)
                                       pf Left$(Sequences(ctr).Title, n) 
                                       pf Left$(Sequences(ctr).Source, n) 
                                       pf Left$(Sequences(ctr).Nam, n) 
                                       pf
                                    Next ctr
                                  '****   
                                    'now make header
                                    u$ = " \                 \  "
                                    u1$ = "  ##.##% "
                                    u2$ = "   (#)   "    
                                    u3$ = "     --  "
                                    u4$ = "      *  "
                                    pf Using$(u$, " ") & Space$(2); 'spacer
                                  '****
                                    For ctr = lb To ub
                                       pf Using$(u2$, ctr); 
                                    Next ctr
                                  '****  
                                    pf " "
                                  '***** 
                                   For a = 0 To 255 'letters
                                  '***** 
                                      If Len(Letter_Name$(a)) > 1 Then           
                                         If a = 0 Then s$ = "-" Else s$ = Chr$(a) 'keep chr$(0) out of text
                                         pf s$ & Using$(u$, Letter_Name$(a));
                                  '***** 
                                        For b = lb To ub 'number of sequences
                                  '***** 
                                            If Ltrs(b, a) Then 'not zero
                                              pct! = Ltrs(b, a) / Sequences(b).Seq_Length * 100 
                                  '***** 
                                               If pct! > .005 Then
                                                  pf Using$(u1$, pct!);
                                                 Else 
                                                  pf u4$; 
                                               End If   
                                  '***** 
                                              Else 
                                               pf u3$;
                                            End If 
                                  '***** 
                                        Next b
                                  '***** 
                                        pf " "
                                      End If
                                  '***** 
                                   Next a
                                  '*****
                                    pf
                                    pf , Using$(" Report took #.# seconds to run", Timer - tmr!)
                                    pf "Note - " & u4$  & " means there was at least one occurence of the letter. "
                                    Close #1
                                  End Sub 'Sub GHL_Count_Letters
                                  '
                                  ' *******************************************************
                                  '  Possible letters in Sequences
                                  Sub z_Assign_Letter_Names
                                   ReDim Letter_Name$(255) '255 so as not to get surprised by errant character somewhere down the line                 
                                     For ctr = LBound(Letter_Name$()) To UBound(Letter_Name$())
                                        'Letter_Name$(ctr) = "Weird Unknown"
                                     Next ctr                     
                                     'Acceptable results
                                      Letter_Name$(0)        = "0000 Gap"
                                      Letter_Name$(Asc("A")) = "0001 Adenosine"
                                      Letter_Name$(Asc("B")) = "1110 Not-A"
                                      Letter_Name$(Asc("C")) = "0010 Cytidine"
                                      Letter_Name$(Asc("D")) = "1101 Not-C"
                                      Letter_Name$(Asc("G")) = "0100 Guanosine"
                                      Letter_Name$(Asc("H")) = "1011 Not-G"
                                      Letter_Name$(Asc("K")) = "1100 Keto"
                                      Letter_Name$(Asc("M")) = "0011 Meta"
                                      Letter_Name$(Asc("N")) = "1111 Any / unknown"
                                      Letter_Name$(Asc("R")) = "0101 Purine"
                                      Letter_Name$(Asc("S")) = "0110 Strong"
                                      Letter_Name$(Asc("T")) = "1000 Thymidine"
                                      Letter_Name$(Asc("V")) = "0111 Not-T"
                                      Letter_Name$(Asc("W")) = "1001 Weak"
                                      Letter_Name$(Asc("Y")) = "1010 Pyrimidine"
                                  End Sub
                                  '
                                  ' *******************************************************
                                  '
                                  Sub z_Assign_Fragments
                                    Dim Files$(1 To 4)
                                     Files$(1) = $Data_Location & "ID_E_Coli.txt"   ' 5,528,444 letters With 6,641 Not ACGT's
                                     Files$(2) = $Data_Location & "Id_46442158.txt" '   244,073 letters With 101 Not ACGT's
                                     Files$(3) = $Data_Location & "ID_126212636.txt"' 1,327,701 letters With All ACGT's
                                     Files$(4) = $Data_Location & "ID_126213112.txt"' 2,182,446 letters With All ACGT's
                                     ReDim Sequences(1 To UBound(Files$())) 'set sequences
                                    ReDim Errant_Chars(255) As Long
                                    ErrClear   
                                    Reset File_counter
                                    While File_counter < UBound(Files$()) 'Get files 1 at a time
                                        Incr File_counter 
                                        fle$  =  Files$(File_counter)
                                      '  fle$  =  Files$(3)
                                        Open fle$ For Binary As #1
                                           ln = Lof(#1)
                                           fil$ = Space$(ln)
                                           Get #1, 1, Fil$
                                           Close #1
                                         If Err Then ? "error" & Str$(Err),,Error$(Err):Exit Sub 
                                      ''' get data
                                          s$ = "\Title\" : i = Len(s$) + 1 : 'start of Title 
                                           s1$ = "\End Title\": i1 = InStr(fil$, s1$) - Len(s$) - 1 : 'end of Title 
                                           Sequences(File_counter).Title = Mid$(fil$, i, i1) ' get title
                                       '
                                          s$ = "\Source\": i = InStr(fil$, s$) + Len(s$):
                                           s1$ = "\End Source\": i1 = InStr(fil$, s1$) - i - 1
                                           Sequences(File_counter).Source = Mid$(fil$, i, i1)
                                      '
                                          s$ = "\Name\": i = InStr(fil$, s$) + Len(s$):
                                           s1$ = "\End Name\": i1 = InStr(fil$, s1$) - i - 1
                                           Sequences(File_counter).Nam = Mid$(fil$, i, i1) ' get name
                                      '     
                                           s$ = "\End Name\": i = InStr(fil$, s$) + Len(s$)
                                           s1$ = Mid$(fil$, i + 1) 'rest is sequence
                                  '
                                             'now make consistent between different formats
                                            s1$ = Remove$(s1$, Any $CrLf) 'line feeds
                                            s1$ = Remove$(s1$, Any "0123456789 ") 'numbers or spaces
                                            s1$ = UCase$(s1$)' all upper case 
                                            s1$ = Trim$(s1$)
                                             'Now test for errant data
                                            tmr = Timer
                                            ReDim Errant_Chars(255) 'erase
                                            Reset Err_Char_Ctr
                                            For ctr = 1 To Len(s1$)
                                               a = Asc(Mid$(s1$, ctr, 1))
                                                If a <> 65 And _' A
                                                   a <> 67 And _' C
                                                   a <> 71 And _' G
                                                   a <> 84 Then  'T
                                                 '? Mid$(s1$, ctr - 2, 100),, Using$("Error at #, with #", ctr, a)             
                                                 Incr Errant_Chars(a)'Keep Track
                                                 Incr Err_Char_Ctr
                                                End If 
                                            Next ctr 
                                          Sequences(File_counter).Sequence = s1$ 'rest is sequence
                                          Sequences(File_counter).Seq_Length = Len(s1$)
                                  '    ?   "Title: " & Trim$(Sequences(File_counter).title) & $CrLf & _
                                  '       "Source: " & Trim$(Sequences(File_counter).Source) & $CrLf & _
                                  '         "Name: " & Trim$(Sequences(File_counter).Nam) & $CrLf & _
                                  '       Left$(Sequences(File_counter).Sequence, 500)& $CrLf  & _
                                  '       Using$(" #, long", Len(Trim$(Sequences(File_counter).Sequence))) & $CrLf &  _
                                  '       Using$("  # Errant Characters", Err_Char_Ctr) _
                                  '        ,,fle$ & Using$(" #, bytes in #.## seconds", ln, Timer - tmr)
                                  '       s$ = fle$ & $CrLf 
                                  ''       If Err_Char_Ctr Then
                                  '          For ctr = 0 To 255       
                                  '             If Errant_Chars(ctr) > 0 Then
                                  '               s$ = s$ & Using$(" ### \ \  #, ", ctr, Chr$(ctr), Errant_Chars(ctr)) & $CrLf  
                                  '             End If
                                  '          Next ctr
                                  '          ClipBoard Set Text s$ & $CrLf 
                                  '          ? s$,, fle$
                                  ''       End If                 
                                  
                                     Open $Data_Location & "Cleaned_Data.txt" For Binary As #1
                                         Put #1, 1, Sequences()
                                         Close #1
                                      cb$ = cb$ & Using$("##) ", File_counter) & Files$(File_counter) & $CrLf & _
                                                  Using$("    #,###,### letters with #, Not ACGT's", Sequences(File_counter).Seq_Length, Err_Char_Ctr) & $CrLf & _
                                                  " Title: " & Trim$(Sequences(File_counter).Title) & $CrLf & _
                                                  "Source: " & Trim$(Sequences(File_counter).Source) & $CrLf & _
                                                  " Named: " & Trim$(Sequences(File_counter).Nam) & $CrLf & $CrLf 
                                     Wend
                                     ClipBoard Set Text cb$
                                  End Sub 
                                  '
                                  Sub z_Get_Cleaned_Data
                                     ReDim Sequences(1 To 4) 'see Assign Fragments for number of sequences used
                                     Open $Data_Location & "Cleaned_Data.txt" For Binary As #1
                                         Get #1, 1, Sequences()
                                         Close #1
                                  End Sub
                                  '
                                  ' *******************************************************
                                  '
                                  Function PBMain () As Long
                                   'setting up
                                    Local tmr As Single
                                    tmr = Timer
                                     Call z_Assign_Letter_Names
                                  '   Call Assign_Fragments 'used to create Clean Data. save 2.5 secs each run using Cleaned Data
                                      z_Get_Cleaned_Data 'quicker after fragments already set up and cleaned
                                  '
                                  ' *******************************************************
                                  '
                                  ' call code tests here
                                     Call GHL_Count_Letters
                                    ? "finished" 
                                  End Function
                                  '
                                  (See next post for report generated.)

                                  If any want to play along, you can dl the same sequences used here from the link at the top of the code. Also you can use this code as a shell and just drop in your own subs/functions using the Sequences UDT array as data. That way only the sub has to be posted here. Just prefix it with your initials.
                                  It's a pretty day. I hope you enjoy it.

                                  Gösta

                                  JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                  LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                  Comment


                                  • #18
                                    And here's the report generated above: (It really sucks this editor doesn't line up the data, even using courier font. Opps, changed from [ quote ] to code and it does.)
                                    Code:
                                     
                                                  Sequence (1) = 5,528,444 Letters
                                     E Coli strand                                                                                                          
                                     [URL]http://www.virus-evolution.org/Downloads/[/URL]                                                                              
                                     Ecoli Sequence                                                                                                         
                                                  Sequence (2) = 244,073 Letters
                                     Candida albicans SC5314 chromosome 7 Ctg19-10262, whole genome shotgun sequence                                        
                                     [URL]http://www.ncbi.nlm.nih.gov/nuccore/46442158[/URL]                                                                           
                                     GenBank: AACQ01000024.1                                                                                                
                                                  Sequence (3) = 1,327,701 Letters
                                     Pichia stipitis CBS 6054 chromosome 1 PICSTchr_1.2, whole genome shotgun sequence                                      
                                     [URL]http://www.ncbi.nlm.nih.gov/nuccore/126212636[/URL]                                                                          
                                     GenBank: AAVQ01000002.1                                                                                                
                                                  Sequence (4) = 2,182,446 Letters
                                     Pichia stipitis CBS 6054 chromosome 1 PICSTchr_1.1, whole genome shotgun sequence                                      
                                     [URL]http://www.ncbi.nlm.nih.gov/nuccore/126213112[/URL]                                                                          
                                     Gene ID:4850976                                                                                                        
                                                               (1)      (2)      (3)      (4)    
                                    - 0000 Gap                  --       --       --       --   
                                    A 0001 Adenosine         24.78%   33.09%   29.41%   29.38%  
                                    B 1110 Not-A                 *       --       --       --   
                                    C 0010 Cytidine          25.21%   17.09%   20.51%   20.45%  
                                    D 1101 Not-C                 *       --       --       --   
                                    G 0100 Guanosine         25.17%   17.25%   20.72%   20.71%  
                                    H 1011 Not-G                 *       --       --       --   
                                    K 1100 Keto                  *       --       --       --   
                                    M 0011 Meta                  *       --       --       --   
                                    N 1111 Any / unknown      0.08%    0.04%      --       --   
                                    R 0101 Purine             0.01%      --       --       --   
                                    S 0110 Strong             0.01%      --       --       --   
                                    T 1000 Thymidine         24.72%   32.52%   29.36%   29.46%  
                                    V 0111 Not-T                 *       --       --       --   
                                    W 1001 Weak                  *       --       --       --   
                                    Y 1010 Pyrimidine         0.02%      --       --       --   
                                                   Report took 1.7 seconds to run
                                    Note -       *   means there was at least one occurence of the letter.
                                    Note the similarity of occurences in 3 & 4 (which is to be expected given their names). These are all, I think, variations of the Ecoli virus.

                                    Also Dr. Oz had a really neat explanatory simulation of a "retrovirus" insinuating itself into a person's DNA on his show (today?). Pretty cool for an ignorant like myself.

                                    What would be neat would be to get a copy of Thomas Jefferson's dna and try to match it to Sally Hemmings. A fun exercise.

                                    ========================================
                                    "A husband is what is left of the lover
                                    after the nerve has been extracted."
                                    Helen Rowland (1876-1950)
                                    ========================================
                                    Last edited by Gösta H. Lovgren-2; 3 Dec 2009, 09:32 PM. Reason: Boundary alignment
                                    It's a pretty day. I hope you enjoy it.

                                    Gösta

                                    JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                    LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                    Comment


                                    • #19
                                      Hello All,

                                      First a bit about me: I'm a long-time lurker and a first-time poster on these forums. I'm also a molecular biologist and biochemist who uses pattern-matching tools of the type being discussed here. I use PB/CC to do stochastic simulations of a complicated gene regulatory network. I'm not a very accomplished programmer but good enough to do what I need; I'm definitely an amateur, in the terms somebody used here, in that I'm the only one who needs my program to run.

                                      Pattern-matching routines of the type being discussed here are a mainstay of a large and well-developed field called "bioinformatics". Somebody very sensibly asked here what types of uses these could be put to; here are my two cents about a tentative answer to that question. A couple of resources to help get towards an answer are the following.

                                      First is a book by my long-time colleague David Mount, described at



                                      The first review on the Amazon site gives some indication of the scope of this book. It will also give you a feeling for the types of issues that the field addresses. I wouldn't be too concerned about the posts moaning about the book being unclear; in my (likely biased) viewpoint, these may well represent the reactions of people who find the topic difficult, but are of the "It Can't Be My Fault" persuasion (this being the Age of Entitlement after all).

                                      Second is a public web site, run from a server at the National Institutes of Health (NIH), which is the main source of funding in the US for biomedical research:

                                      The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.


                                      Poking around with these tools will give those interested a sense of the types of comparisons that users of pattern-matching tools like to be able to make.

                                      One of the links at this site, "protein blast", leads to issues dealing with a comparable but more complex set of pattern-matching problems than the one this thread has addressed. This has to do with another component of cells, namely proteins. Like DNA, these can be thought of as linear strings of symbols. The symbols are called "amino acids", and there are twenty of them, rather than the 4 "bases" of DNA. Proteins are generally shorter than DNA, typically in the range of 100-2000 amino acids in length. That might be thought to simplify the pattern-matching problem. Compensating for that, however, is the larger number of amino acids, and the fact, harder to deal with, that certain of them are usually thought to be "similar" though not identical. [For the aficionado, this means that certain pairs or groups of amino acids have chemically similar behavior, so that changing one to a similar one might be (but isn't invariably) innocuous.] There are different conventions as to what constitutes "similarity", as well. David's book discusses this at length.

                                      One fairly typical use of pattern-matching for proteins has to do with trying to discern evolutionary relationships among different organisms. Suppose that you are dealing with a particular protein from a bunch of related species. During the course of evolution, this protein is likely to have changed; the longer ago the organisms diverged (we're talking time scale of millions of years here), the more different the protein will typically be. So pattern-matching programs can be used to compare them and give some kind of alignment which will help answer this question. An example is described at http://www.clustal.org/

                                      I don't know enough to say technically how these pattern-matching programs work. David Mount's book will give a much more authoritative description. The bottom line is that this is a very large and well-developed field. If slick programming can make the tools 10 or 100x faster, it would find very wide and welcome application.

                                      Cheers
                                      John Little

                                      Comment


                                      • #20
                                        Welcome!

                                        Hey, John,

                                        Welcome to the foreground!

                                        That's one cool first post! Thanks for all that background info - it'll keep me busy reading thru the holidays.

                                        -John

                                        Comment

                                        Working...
                                        X