Announcement

Collapse
No announcement yet.

Searching with mismatches

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching with mismatches

    sorry if this has been covered before.

    If I had a long text string (eg. 5000 characters), and I wanted to find out whether there was a partially matching sequence of characters in a second string of 5000 characters, are there any ways of speeding up the matching process. The basic search method, I imagine, would involve scanning every combination of consecutive letters in the two strings until a matching run was found, for example, a required match of 20 letters from 30. This is horribly time consuming, and the time taken goes up proportional to the square of the string lengths.

    It occurred to me that if INSTR has an additional parameter that specified the number of mismatches that would be tolerated, it would be very easy to cut one string into 30 character partly overlapping fragments (search_fragment$), and simply test each one against the other string (search_string$):

    eg.

    match_pos = INSTR(search_string$, search_fragment$, 10)

    where 10 is the number of mismatches allowed, and match_pos would be assigned the first matching position.

    INSTR obviously doesn't work in this way, and none of the matching functions tolerate mismatches as would be required here.

    I realise this is a simple problem that must have been solved elegantly before (probably in the 1950s or 1960s!), so any thoughts on quicker strategies would be useful.

    Thanks

    Peter

  • #2
    Might this thread help?

    http://www.powerbasic.com/support/pb...on+subsequence
    <b>George W. Bleck</b>
    <img src='http://www.blecktech.com/myemail.gif'>

    Comment


    • #3
      other considerations;

      REGEXPR might be best
      VERIFY([start&,] MainString, MatchString)

      (using ANY option)
      TALLY(MainString, [ANY] MatchString)
      INSTR([Pos&,] Main$, [ANY] Match$)

      Suffix tree; http://en.wikipedia.org/wiki/Suffix_tree
      Suffix array; http://en.wikipedia.org/wiki/Suffix_array
      Burrows–Wheeler transform; value
      stanthemanstan~gmail
      Dead Theory Walking
      Range Trie Tree
      HLib ~ Free Data Container Lib ~ Arrays, Lists, Stacks, Queues, Deques, Trees, Hashes

      Comment


      • #4
        How many matching sequences are you looking for, all of them or just the first one?

        If you want 'em all, then there is a good chance that quite a few single characters will have lots of matches, so you may want to specify a mimimum length.

        How can 30 equally sized overlapping fragments in a string of 5000 characters contain all the possible candidates for a substring match?

        Comment


        • #5
          Add one more option... "or pretty close"... and you understand why Google(r) pays its lexer coders well.
          Michael Mattias
          Tal Systems (retired)
          Port Washington WI USA
          [email protected]
          http://www.talsystems.com

          Comment


          • #6
            Ideally it should find all of them and record the one with the highest number of matches

            The idea is to find matches of, for example, 20 matches from 30 aligned letters in the two strings. The problem is that the matched letters need not be consecutive, so its not really a case of finding minimum length matches as you would in REGEXPR.

            Ill try to look at the examples of tree searching methods, but this look a bit beyond me!

            Thanks for your help

            Peter

            Comment


            • #7
              Originally posted by Peter Simmonds View Post
              The problem is that the matched letters need not be consecutive...
              Peter
              I'm not understanding that. If the letters are not consecutive, how can they match?

              Example code would be appreciated. Preferably compileable. Which would probably spawn a dozen different answers from half as many sources. (Everybody having more than one way to skin a cat.)

              =========================================
              A vacant mind invites dangerous inmates.
              (Nicholas Hilliard)
              =========================================
              It's a pretty day. I hope you enjoy it.

              Gösta

              JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
              LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

              Comment


              • #8
                Peter, do you mean that, for example, ABC would match with xAxBxCx, where x represents zero or more charcters which are not A, B or C?

                Comment


                • #9
                  Hi

                  sorry if this was unclear. For the hypothetical example of the 20 matches from 30 numbers or letters, the following would be a match:

                  Code:
                  1234567890123456789012345678901234567890123456780
                        ***   * * ****  * * *********  ***      *
                  3456217893563357789033315778901234523890455287744
                  since there is a region where enough numbers are the same within a 30 letter window.

                  Principle is that the matches don't have to be consecutive so the minimum length methods cannot be used.

                  I could post code for the slow way of doing this, but it really is rubbish when scaled up!

                  Peter

                  Comment


                  • #10
                    Pseudo code:
                    Code:
                    IF LEN(searchme$)<=LEN(matchme$) THEN
                      FOR c=1 TO LEN(searchme$)-29
                        mismatch=0
                        match=0
                        FOR cc=c TO c+29
                          IF MID$(searchme$,cc,1)<>MID$(matchme$,cc,1) THEN
                            INCR mismatch
                            IF mismatch=11 THEN EXIT FOR
                          ELSE
                            INCR match
                          END IF
                        NEXT cc
                        IF mismatch>10 THEN ITERATE FOR
                        'at this point save your 'c' value or do whatever you want to do with a
                        'matched string
                      NEXT c
                    END IF
                    Rod
                    In some future era, dark matter and dark energy will only be found in Astronomy's Dark Ages.

                    Comment


                    • #11
                      Originally posted by Peter Simmonds View Post
                      Hi

                      sorry if this was unclear. For the hypothetical example of the 20 matches from 30 numbers or letters, the following would be a match:

                      Code:
                      1234567890123456789012345678901234567890123456780
                            ***   * * ****  * * *********  ***      *
                      3456217893563357789033315778901234523890455287744
                      since there is a region where enough numbers are the same within a 30 letter window.

                      Principle is that the matches don't have to be consecutive so the minimum length methods cannot be used.

                      I could post code for the slow way of doing this, but it really is rubbish when scaled up!

                      Peter
                      Post the code for the "slow" way. I, for one, still don't see what you are getting at. Probably because I am a "slow" myself. Do the stars represent where there is a match in the two strings?

                      Okay, I printed in a text editor and think I see what you mean.

                      1234567890123456789012345678901234567890123456780
                      3456217893563357789033315778901234523890455287744
                      ------***-------****------*********--***------*--




                      Still code would be a better place to start.

                      ======================================
                      "Few things are harder to put up with
                      than a good example."
                      Mark Twain (1835-1910)
                      ======================================
                      Last edited by Gösta H. Lovgren-2; 30 Oct 2009, 08:07 AM.
                      It's a pretty day. I hope you enjoy it.

                      Gösta

                      JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                      LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                      Comment


                      • #12
                        agrep might be useful in this situation.
                        Erich Schulman (KT4VOL/KTN4CA)
                        Go Big Orange

                        Comment


                        • #13
                          Something like this?
                          Code:
                          PBCC5 program
                          #COMPILE EXE
                          
                          
                          #REGISTER NONE
                          FUNCTION PBMAIN () AS LONG
                          
                          'create a test string
                          FOR r& = 1 TO 5000
                              a$=a$+CHR$(RND(97,122))
                          NEXT
                          
                          'create a similar string with a mismatches
                          b$=a$
                          FOR r& = 1 TO 3000
                              MID$(b$,RND(1,LEN(b$)),1)=CHR$(RND(97,122))
                          NEXT
                          k&= GoodMatch(a$,b$,30,6)
                          PRINT k&
                          PRINT MID$(a$,k&,30)
                          PRINT MID$(b$,k&,30)
                          
                          k&= GoodMatch(a$,b$,30,7)
                          PRINT k&
                          PRINT MID$(a$,k&,30)
                          PRINT MID$(b$,k&,30)
                          
                          k&= GoodMatch(a$,b$,30,8)
                          PRINT k&
                          PRINT MID$(a$,k&,30)
                          PRINT MID$(b$,k&,30)
                          
                          k&= GoodMatch(a$,b$,30,9)
                          PRINT k&
                          PRINT MID$(a$,k&,30)
                          PRINT MID$(b$,k&,30)
                          
                          k&= GoodMatch(a$,b$,30,10)
                          PRINT k&
                          PRINT MID$(a$,k&,30)
                          PRINT MID$(b$,k&,30)
                          
                          
                          
                          WAITKEY$
                          END FUNCTION
                          
                          
                          FUNCTION GoodMatch(StringToSearch AS STRING, WhatToSearchFor AS STRING, SegmentSize AS LONG,MismatchesAllowed AS LONG) AS LONG
                              
                              FOR r& = 1 TO SegmentSize
                                  IF MID$(StringToSearch,r&,1) = MID$(WhatToSearchFor,r&,1) THEN
                                      MatchCount& = MatchCount& + 1
                                  END IF
                              NEXT
                              
                              FOR r& = SegmentSize+ 1 TO LEN(StringToSearch)
                                  IF MID$(StringToSearch,r&,1) = MID$(WhatToSearchFor,r&,1) THEN
                                      MatchCount& = MatchCount& + 1
                                  END IF
                                  IF MID$(StringToSearch,r&-SegmentSize,1) = MID$(WhatToSearchFor,r&-SegmentSize,1) THEN
                                      MatchCount& = MatchCount& - 1
                          
                                  END IF
                          
                                  IF SegmentSize - MatchCount& < MismatchesAllowed THEN
                                      FUNCTION = r&-SegmentSize
                                      EXIT FUNCTION
                                  END IF
                              NEXT
                                  
                              FUNCTION = 0
                              
                          END FUNCTION

                          Comment


                          • #14
                            Paul

                            It's more complicated than that, the program you wrote assumes that the two text strings are aligned, whereas what it really needs to do is find a 30/20 match anywhere in the StringToSearch and WhatToSearchFor strings, for example there may be match between a window starting at position 500 in StringToSearch and at 2544 in WhatToSearchFor.

                            Unless I am stupidly missing something in your program which is always possible!

                            Peter

                            Comment


                            • #15
                              The slow way to do it!

                              As requested, here is the simple way to find the match, but you'll notice the three nested loops which makes the search time proportion to:

                              the length of search string * length of the query string * required match length

                              which means that it scales very poorly for large searches.

                              Code:
                              #COMPILE EXE
                              DEFLNG a-z
                              
                              FUNCTION PBMAIN () AS LONG
                              
                                  CLS: PRINT "Sequence scan"
                              
                                  REM Just some example text from the helpfile
                                  query$ = "If the ANY keyword is included, MatchStr  specifies a list of single characters to be searched for individually, a match on any one of which will cause the"
                                  search$ = "starting with its first character (or the character specified by start). EXTRACT$ returns single characters of MainStr, "
                              
                                  REM specifying what match is required, ie at least 21 characters out of a text window of 30
                                  match_length = 30: mismatch_target = 9: REM maximum number of tolerated mismatches, ie. 30 minus 21
                              
                                  REM scan individual characters within the window
                                  FOR a = 1 TO LEN(query$) - match_length
                              
                                      query_fragment$ = MID$(query$, a, match_length)
                              
                                      FOR b = 1 TO LEN(search$) - match_length
                              
                                          search_fragment$ = MID$(search$, b, match_length)
                              
                                          REM Initalise counters
                                          mismatch_total = 0: scan_total = 0: match_found = 0: REM to keep running totals
                              
                                          REM Scan subfragments from search$ and query$
                                          FOR c = 1 TO match_length
                              
                                              INCR scan_total
                                              IF UCASE$(MID$(query_fragment$, c, 1)) <> UCASE$(MID$(search_fragment$, c, 1)) THEN INCR mismatch_total
                              
                                              REM early exits if too many mismatches or match total gone beyond maximum needed
                                              IF mismatch_total > mismatch_target THEN EXIT FOR
                                              IF scan_total - mismatch_total >= match_length - mismatch_target THEN match_found = 1: EXIT FOR
                              
                                          NEXT c
                                                  
                                          REM Found a match
                                          IF match_found THEN GOSUB Report_match
                              
                                      NEXT b
                              
                                  NEXT a
                              
                                  PRINT: PRINT "The end"
                                  WAITKEY$
                              
                                  EXIT FUNCTION
                              
                              Report_match:
                              
                                  PRINT: PRINT "Match found at"; a; "in the query sequence and"; b; "in the search sequence"
                                  PRINT query_fragment$
                              
                                  FOR d = 1 TO match_length
                                      IF UCASE$(MID$(query_fragment$, d, 1)) = UCASE$(MID$(search_fragment$, d, 1)) THEN PRINT "|"; ELSE PRINT " ";
                                  NEXT d
                              
                                  PRINT: PRINT search_fragment$
                                  
                                  WAITKEY$
                              
                              RETURN
                              
                              END FUNCTION
                              (Written in PBCC 4.0)

                              Peter
                              Last edited by Peter Simmonds; 30 Oct 2009, 11:44 AM. Reason: Error in REM statement

                              Comment


                              • #16
                                A faster "slow" way, but I can't see a better "not slow" way right now.
                                Code:
                                #COMPILE EXE
                                DEFLNG a-z
                                
                                FUNCTION PBMAIN () AS LONG
                                   LOCAL searchFragPtr, queryFragPtr AS BYTE PTR, printThin AS LONG
                                
                                    CLS: PRINT "Sequence scan"
                                
                                    REM Just some example text from the helpfile
                                    'Make all capital from the beginning and save a lot of calculating time.
                                    query$ = REPEAT$(50, UCASE$("If the ANY keyword is included, MatchStr  specifies a list of single characters to be searched for individually, a match on any one of which will cause the"))
                                    search$ = REPEAT$(50, UCASE$("starting with its first character (or the character specified by start). EXTRACT$ returns single characters of MainStr, "))
                                
                                    REM specifying what match is required, ie at least 21 characters out of a text window of 30
                                    match_length = 30: mismatch_target = 9: REM maximum number of tolerated mismatches, ie. 30 minus 21
                                
                                    REM scan individual characters within the window
                                    FOR a = 1 TO LEN(query$) - match_length
                                
                                        query_fragment$ = MID$(query$, a, match_length)
                                        queryFragPtr = STRPTR(query_fragment$)
                                
                                        FOR b = 1 TO LEN(search$) - match_length
                                
                                            search_fragment$ = MID$(search$, b, match_length)
                                            searchFragPtr = STRPTR(search_fragment$)
                                
                                            REM Initalise counters
                                            mismatch_total = 0: scan_total = 0: match_found = 0: REM to keep running totals
                                
                                            REM Scan subfragments from search$ and query$
                                            FOR c = 1 TO match_length
                                
                                                INCR scan_total
                                                'You have a choice here of ASC or even faster, index pointers.
                                                IF @searchFragPtr[c-1] <> @queryFragPtr[c-1]  THEN INCR mismatch_total
                                '               IF ASC(query_fragment$, c) <> ASC(search_fragment$, c) THEN INCR mismatch_total
                                
                                                REM early exits if too many mismatches or match total gone beyond maximum needed
                                                IF mismatch_total > mismatch_target THEN EXIT FOR
                                                IF scan_total - mismatch_total >= match_length - mismatch_target THEN match_found = 1: EXIT FOR
                                
                                            NEXT c
                                
                                            REM Found a match
                                            IF match_found THEN GOSUB Report_match
                                
                                        NEXT b
                                
                                    NEXT a
                                
                                    PRINT: PRINT "The end"
                                    WAITKEY$
                                
                                    EXIT FUNCTION
                                
                                Report_match:
                                
                                    INCR printThin
                                    IF (printThin AND 31) = 0 THEN    'only print 1 of every 32 matches
                                       PRINT: PRINT "Match found at"; a; "in the query sequence and"; b; "in the search sequence"
                                       PRINT query_fragment$
                                
                                       FOR d = 1 TO match_length
                                           IF UCASE$(MID$(query_fragment$, d, 1)) = UCASE$(MID$(search_fragment$, d, 1)) THEN PRINT "|"; ELSE PRINT " ";
                                       NEXT d
                                
                                       PRINT: PRINT search_fragment$
                                       WAITKEY$
                                    END IF
                                
                                
                                RETURN
                                
                                END FUNCTION

                                Comment


                                • #17
                                  btw, it is usually difficult to break out of the O(n²) time complexity prison. Ya just don't know tho, a lot of ideas arise from a lot o' gray matter here.

                                  Comment


                                  • #18
                                    How many distinct characters are there in either string? 10 number characters, Upper case letters, lower case letters, punctuation characters, etc.?
                                    You might try a density function to verify that there are 21 characters in the match string segment that match the search string segment before checking for matching positions.
                                    Rod
                                    In some future era, dark matter and dark energy will only be found in Astronomy's Dark Ages.

                                    Comment


                                    • #19
                                      Hi

                                      I used a text example in the posted program but actually the real implementation is for DNA sequence matching where there are only 4 different characters (G, A, T and C and perhaps a fifth for undetermined). This might simplify the problem although a general algorithm might be better in the long run.

                                      Peter

                                      Comment


                                      • #20
                                        This might simplify the problem although a general algorithm might be better in the long run.
                                        Better for whom? Optimization is always application-specific.

                                        I'm not a pattern-match guy, but I am cerrtain limiting the target character set to this small subset of the ANSI characters can vastly simplify the solution, but it may require an entirely fresh approach to take advantage of this previously-unstated fact.

                                        MCM
                                        Michael Mattias
                                        Tal Systems (retired)
                                        Port Washington WI USA
                                        [email protected]
                                        http://www.talsystems.com

                                        Comment

                                        Working...
                                        X