sorry if this has been covered before.
If I had a long text string (eg. 5000 characters), and I wanted to find out whether there was a partially matching sequence of characters in a second string of 5000 characters, are there any ways of speeding up the matching process. The basic search method, I imagine, would involve scanning every combination of consecutive letters in the two strings until a matching run was found, for example, a required match of 20 letters from 30. This is horribly time consuming, and the time taken goes up proportional to the square of the string lengths.
It occurred to me that if INSTR has an additional parameter that specified the number of mismatches that would be tolerated, it would be very easy to cut one string into 30 character partly overlapping fragments (search_fragment$), and simply test each one against the other string (search_string$):
eg.
match_pos = INSTR(search_string$, search_fragment$, 10)
where 10 is the number of mismatches allowed, and match_pos would be assigned the first matching position.
INSTR obviously doesn't work in this way, and none of the matching functions tolerate mismatches as would be required here.
I realise this is a simple problem that must have been solved elegantly before (probably in the 1950s or 1960s!), so any thoughts on quicker strategies would be useful.
Thanks
Peter
If I had a long text string (eg. 5000 characters), and I wanted to find out whether there was a partially matching sequence of characters in a second string of 5000 characters, are there any ways of speeding up the matching process. The basic search method, I imagine, would involve scanning every combination of consecutive letters in the two strings until a matching run was found, for example, a required match of 20 letters from 30. This is horribly time consuming, and the time taken goes up proportional to the square of the string lengths.
It occurred to me that if INSTR has an additional parameter that specified the number of mismatches that would be tolerated, it would be very easy to cut one string into 30 character partly overlapping fragments (search_fragment$), and simply test each one against the other string (search_string$):
eg.
match_pos = INSTR(search_string$, search_fragment$, 10)
where 10 is the number of mismatches allowed, and match_pos would be assigned the first matching position.
INSTR obviously doesn't work in this way, and none of the matching functions tolerate mismatches as would be required here.
I realise this is a simple problem that must have been solved elegantly before (probably in the 1950s or 1960s!), so any thoughts on quicker strategies would be useful.
Thanks
Peter
Comment