Announcement

Collapse

New Sub-Forum

In an effort to help make sure there are appropriate categories for topics of discussion that are happening, there is now a sub-forum for databases and database programming under Special Interest groups. Please direct questions, etc., about this topic to that sub-forum moving forward. Thank you.
See more
See less

Searching for Algorithm for Fast Lookup in String-Data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching for Algorithm for Fast Lookup in String-Data

    I'm working on a project where I need to check for the occurrence of a string in tables with several thousand values. For speed reasons I would like to include this information directly into the code. In order not to start from scratch I'm looking for code able to do that.

    The first Idea that occurred to me was to define a string literal containing the table and to search the table with InStr. As far as it can be told now, I'll only need the information whether strToSearch is contained or not. I think InStr will produce fast code for this kind of search. I use a database elsewhere in the project (Tsunami-DB TrmPro.DLL), but I think it's use would be far slower than the simple string search directly in the RAM:
    Code:
    $searchedString=";0050;0051;0052;0053;0054;0055;0061;0062;0063;0064;0065;0066;0070;0071;0072;0073;0080;0081;0082;0083;0084;0101;0111;0112;0113;0114;0115;0118;0119;0121;0122;0123;0124;0125;0131;0132;0139;0141;0142;0151;0152;0153;0159;016;0201;0202;0203;0204;0205;0206;0207;0211;0212;0213;0214;022;0231;0232;0233;0234;0235;0239;0242;0243;0291;0292;0293;0294;0295;0296;0299;0301;0302;0309;031;0321;0329;0332;0339;034;0351;0352;0353;0359;036;0371;0372;0379;0393;0394;0396;0397;0398;0399;0401;0402;0403;0404;0405;0406;0407;0411;0412;0419;042;043;0441;0442;0443;0444;0449;045;046;0471;0472;0473;0474;0475;0476;0479;0480;0489;0491;0492;0493;0499;050;0511;0519;0521;0522;0523;0524;0525;0529;0532;0539;0581;0589;059;0601;0602;0609;0611;0612;0613;0619;062;0631;0639;064;0650;0651;0652;066;067;"
    
    strToSearch = ";0061;"
    
    IF InStr(1, $searchedString, strToSearch) > 0 THEN
      MsgBox "Data found"
    ELSE
      MsgBox "Data not found"
    END IF
    This will not work, as the complier restricts the input to 256 characters per line.


    A workaround could look as shown in the next sample. But: I do not like this idea - this will generate code that needs to be executed at least once when my .DLL is loaded. Considering the size of my tables (look further below) I expect this initialization to be time consuming.
    Code:
    searchedString=";0050;0051;0052;0053;0054;0055;0061;0062;0063;0064;0065;0066;0070;0071;0072;0073;0080;0081;0082;0083;0084;0101;0111;0112;0113;0114;0115;0118;0119;0121;0122;0123;0124;0125;0131;0132;0139;0141;"
    searchedString=searchedString&"0142;0151;0152;0153;0159;016;0201;0202;0203;0204;0205;0206;0207;0211;0212;0213;0214;022;0231;0232;0233;0234;0235;0239;0242;0243;0291;0292;0293;0294;0295;0296;0299;0301;0302;0309;"
    searchedString=searchedString&"031;0321;0329;0332;0339;034;0351;0352;0353;0359;036;0371;0372;0379;0393;0394;0396;0397;0398;0399;0401;0402;0403;0404;0405;0406;0407;0411;0412;0419;042;043;0441;0442;0443;0444;0449;"
    searchedString=searchedString&"045;046;0471;0472;0473;0474;0475;0476;0479;0480;0489;0491;0492;0493;0499;050;0511;0519;0521;0522;0523;0524;0525;0529;0532;0539;0581;0589;059;0601;0602;0609;0611;0612;0613;0619;062;"
    searchedString=searchedString&"0631;0639;064;0650;0651;0652;066;067;"
    
    strToSearch = ";0061;"
    
    IF InStr(1, searchedString, strToSearch) > 0 THEN
      MsgBox "Data found"
    ELSE
      MsgBox "Data not found"
    END IF
    I showed here a sample only. In reality 'searchedString' will be 10'000-20'000 characters and I have several different 'searchedString', totaling to approximately 40'000 characters, but quite different in size (from 200 to the said 10'000-20'000 characters). The input data might change in future and become larger than 64k. Right now I do not expect this to happen for the next 2 years or so and could life with a 64k limit.

    I think I'm not the first one with such a problem. Who can refer me to a possible solution?

    Walter

  • #2
    > 256 characters per line.

    Use the line continuation character (the underscore) to get around that.
    Code:
    $MyString = "Hello (etc etc etc)" + _
                "World (etc.)
    The compiler will build the long string at compile time.

    [ADDED] You might also consider storing the long string in a stand-alone file, and loading it from the disk into a global variable when your DLL initializes.

    -- Eric
    Last edited by Eric Pearson; 17 Oct 2007, 07:40 AM.
    "Not my circus, not my monkeys."

    Comment


    • #3
      Is the string data sorted?
      Roy Cline

      Comment


      • #4
        @Eric
        Thank you for the idea - never thought about using the line continuation character in string literal, but will test that.
        Storing as standalone file seems not very practical, as the structure of 'searchedString' is not fixed. It would force me to store the length of each 'searchedString' along with the stringdata and to use an intelligent and time consuming algorithm to load the data. Idea is feasible, but does not seem cute and simple enough.

        @Roy
        It would be no problem to sort the data in any desired manner. I already transfered the data into an MS-Access table and will have to treat the different sections of this table before I can transform them into my PB code.

        When using a sorting algorithm, it must be taken into consideration, that the strings are different in length: 3 to 5 characters.

        Walter

        Comment


        • #5
          If it is feasible to sort the data and use binary look up.

          If data changes then this may or may not be the quickest.

          If data is different each time then a lineral search is proable
          the way to go (i.e. instring).
          Roy Cline

          Comment


          • #6
            @Eric
            There seems to b a limit for the length of string literals (assumed: 256 characters). I got the message 'Invalid String Length' when I tried to compile it.

            Code:
            #COMPILE EXE
            #DIM ALL
            
            $Lexcendo    = "0050;0051;0052;0053;0054;0055;0061;0062;0063;0064;0065;0066;0070;0071;0072;0073;0080;0081;0082;" & _
                           "0083;0084;0101;0111;0112;0113;0114;0115;0118;0119;0121;0122;0123;0124;0125;0131;0132;0139;0141;" & _
                           "0142;0151;0152;0153;0159;016;0201;0202;0203;0204;0205;0206;0207;0211;0212;0213;0214;022;0231;0232;" 
            
            FUNCTION PBMAIN () AS LONG
            
            DIM ix AS INTEGER
              MSGBOX "Start"
              FOR ix=1 TO 10
                IF INSTR(1, $Lexcendo, ";9972;" THEN
                END IF
              NEXT ix
              MSGBOX "Stop"
            END FUNCTION
            Last edited by Walter Schütz; 17 Oct 2007, 09:20 AM.

            Comment


            • #7
              @Roy
              Data only changes between releases of my DLL. I decided to use the literal search as I assumed it will be quite fast:
              In the worst case it will need to compare 10'000 Bytes for my worst (average for 20'000 character string) 'searchedString'. Binary search would need approx. 12-13 iterations to find the solution.
              But most of the 'searchedString' are much shorter than these 20'000 characters. I think for short and medium strings a search by InStr will very fast (if it gets translated to optimzed assembly statemnets – what assume for PB)

              Beside that: this will not solve the initialization problem. Initialization must be very effective as well: sometimes the string is searched just once, in other situations several 10'000 times for one loading of the DLL. So initialization needs to be effective as well.

              Comment


              • #8
                Walter,

                "String equates are individually limited to 255 characters. " (PB Help file)

                So use a normal string assignment, say in PBMAIN, e.g.: Lexcendo$ = "....
                Rick Angell

                Comment


                • #9
                  >and loading it from the disk into a global variable when your DLL initializes.

                  If the string is constant (as are most data tables) , I think this application is screaming for a User-Defined resource to eliminate the external disk file....

                  User-Defined Resource Demo January 26 2003

                  (Oh look: a function named "ResourceAsString". That sounds promising)

                  MCM
                  Michael Mattias
                  Tal Systems Inc. (retired)
                  Racine WI USA
                  [email protected]
                  http://www.talsystems.com

                  Comment


                  • #10
                    Binary Search

                    Yep a binary search would be best IF the data can be sorted. You also need to be careful about variable length strings when sorting.

                    Unless the data needed to be hidden I would maintain a file of string values already in sorted order and load them into an array at program startup.

                    A binary search would likely yield a hit in maybe only 15 comparisons of a 50,000 item list (if I did my Log base 2 calcs right).

                    If you want the theory here are some links:

                    Wikipedia on binary search (with pseudo code)
                    http://en.wikipedia.org/wiki/Binary_search

                    Log base 2 conversion for the math
                    http://www2.sims.berkeley.edu/course...Base2Logs.html
                    Mark Strickland, CISSP, CEH
                    SimplyBASICsecurity.com

                    Comment


                    • #11
                      > A binary search .....If you want the theory ..pseudo code..

                      Well, theory and pseudo-code are fine, but If you want actual working PowerBASIC code you could get it right here at Binary Search of an array February 14 2000, July 15 2003

                      MCM
                      Michael Mattias
                      Tal Systems Inc. (retired)
                      Racine WI USA
                      [email protected]
                      http://www.talsystems.com

                      Comment


                      • #12
                        Walter

                        Is the actual searchedString data only numbers, as you illustrated, or is it alpha numeric with some regular pattern or relationship that would lend itself to a simple T/F bit array? This is to say that if the actual "strings" in any searchedString represent the available values from a fixed possible world of values there may be other ways to get at the result desired directly.
                        Rick Angell

                        Comment


                        • #13
                          That demo data, while portrayed as character strings, sure look numeric.

                          Maybe you could either...
                          1. Load the data table once, converting strings to numbers, save in GLOBAL or STATIC variable and use numbers for future comparisons
                          OR
                          2. When you build your data tables as files (used either externally or as a resource), do the numeric conversion and sorting at that time so you don't have to do it all at runtime. (I do this. I figure as long as I have to create a file to use as a data table resource anyway, I may as well do this part at that time).

                          But the real key to performance here is .. using numbers instead of strings if at all possible.


                          MCM
                          Michael Mattias
                          Tal Systems Inc. (retired)
                          Racine WI USA
                          [email protected]
                          http://www.talsystems.com

                          Comment


                          • #14
                            Walter, if a little asm isn't too worrisome, you can easily include a compiled string of large size into your code (I've squeezed over 10MB in there) by using:

                            !db

                            That's data in BYTE format that will be stored at compile time.

                            Example:
                            Your original string was:
                            Code:
                            $searchedString=";0050;0051;0052;0053;0054;0055;0061;0062;0063;0064;0065;0066;0070;0071;0072;0073;0080;0081;0082;0083;0084;0101;0111;0112;0113;0114;0115;0118;0119;0121;0122;0123;0124;0125;0131;0132;0139;0141;0142;0151;0152;0153;0159;016;0201;0202;0203;0204;0205;0206;0207;0211;0212;0213;0214;022;0231;0232;0233;0234;0235;0239;0242;0243;0291;0292;0293;0294;0295;0296;0299;0301;0302;0309;031;0321;0329;0332;0339;034;0351;0352;0353;0359;036;0371;0372;0379;0393;0394;0396;0397;0398;0399;0401;0402;0403;0404;0405;0406;0407;0411;0412;0419;042;043;0441;0442;0443;0444;0449;045;046;0471;0472;0473;0474;0475;0476;0479;0480;0489;0491;0492;0493;0499;050;0511;0519;0521;0522;0523;0524;0525;0529;0532;0539;0581;0589;059;0601;0602;0609;0611;0612;0613;0619;062;0631;0639;064;0650;0651;0652;066;067;"
                            This would be (partially) coded using db as follows:

                            Code:
                            searchStringData:
                            !db 59, 48,48,53,48, 59, 48,48,53,49, 59, 48,48,53,50, 59, 48,48,53,51
                            !db 59, 48,48,53,52, 59, 48,48,53,53, 59, 48,48,54,49, 59, 48,48,54,50
                            !db 59, 48,48,54,51, 59, 48,48,54,52, 59, 48,48,54,53, 59, 48,48,54,54
                            '... continued to as many lines as needed up to 64000+ lines, 250+ chrs long each.
                            You can think of it similar to using CHR$(59, 48,48,53,48, 59,...) but it is calculated at compile time, not run time.

                            To use it, you could do this:
                            Code:
                            LOCAL ssdPtr AS STRING PTR * 20000  '<< however many characters your string actually has 
                            ssdPtr = CODEPTR(searchStringData)  'points to first byte of your !db data 
                            x = INSTR(@ssdPtr, ";0061;")        'search for your substring
                            If you're interested, I can post a full working example.
                            Last edited by John Gleason; 17 Oct 2007, 12:35 PM. Reason: removed dupe

                            Comment


                            • #15
                              Yes. numeric values are key to speed if possible.

                              What I had in mind, Michael, was if the relationship of Walter's actual data must be in a fixed "world" range of values, was a (1) a bit array for each table (2) an algo that can translate the search "string" to an index value directly used to check the bit in the bit array for the table. That is the minimum memory usage. A similar concept is a numeric array for each table that does the same basic thing, but in that case the search "string" is the index you want to check in that array directly. In both of these variatiions the numeric array still represents T/F versus possible values, so initialization is required. I can think of a third way here to use the bit array concept but only select and read a data string ... lots of ways, but we need more specifics. If either of these are possible it ought to be possible to eliminate "searching" altogether and just directly check the T/F status.

                              However at a minimum, if the string data could be related to numeric values then even a numeric array search would be an improvement as well.
                              Rick Angell

                              Comment


                              • #16
                                Walter --

                                > Storing as standalone file seems not very practical,
                                > as the structure of 'searchedString' is not fixed.

                                That doesn't matter, you can still store the string in a disk file. OPEN FOR BINARY... GET$ into a string variable... CLOSE. Voila, cute and simple, the variable contains exactly the string you need. Using a separate file would also allow you to update the "definition file" without updating the DLL, which may or may not be important to you.

                                > There seems to b a limit for the length of string literals

                                Oops, I forgot about that. As Richard suggested, use a string variable instead of a string equate and it should work fine. But I'd still prefer the load-from-file angle.

                                Of course the basic premise is using INSTR for the search. It would certainly be possible, as others have suggested, to avoid the huge string and use a more sophisticated (and presumably faster) search technique. If super-tuned runtime speed isn't an issue, I'd say keep it simple and use INSTR. The other end of the scale (warp speed) involves ASM programming.

                                If you are thinking "loading the data from the disk can't possibly be as fast as embedding it in the DLL", remember that the DLL has to be loaded from disk too. One way or another, you'll be loading your string from the disk.

                                MCM --

                                > I think this application is screaming for a User-Defined resource

                                In a professional-distrib circumstance, I agree. For an in-house utility, it's more effort than it's worth. I know, you like "style points" but I like getting the job done in a situation-appropriate way.

                                -- Eric
                                "Not my circus, not my monkeys."

                                Comment


                                • #17
                                  I developed a prototype to gain a clearer view about the speed. These are the results:

                                  String search INSTR:
                                  • 0.158 ms to initialize it with 13k of character data (2700 items)
                                  • 0.075 ms to search it worst case (=strSearch not contained)
                                  • 0.036 ms to search medium case
                                  • 0.0 ms to search for the best case



                                  Binary Search (prototype)
                                  • 0.11 ms to initialize a simple Array of String with 13k of character data (2700 items)
                                  • 0.0006 ms to search one item


                                  I was very astonished how much faster the binary search is! I expected the optimized code of today's processors, to scan such small amounts of data much faster.

                                  Comment


                                  • #18
                                    Thank you for all these ideas. I like Michael's idea with the user defined resources for its manageability. To distribute only a few files is very important in this project. John Gleason's idea with inline assembly data will result in no additional files as well.

                                    What I want to avoid is loading the initialization data twice. This speaks against my workaround sample and again favor's the inline assembly or loading from an independent file. I think this speaks against Michael's Resource based solution – but I would need to do a test on that.

                                    The most stunning idea was to look at this data as numbers instead of strings. I work with this kind of data as strings for so many years (and everyone else who uses this data does as well), that I did not even took this into consideration. But: I think with some brainwork it might be done and lead to a tremendously enhanced solution. There are two types of data: 4 digits (-> INTEGER) and one character followed by 4 digits (-> LONG). This sounds simple to convert, but sometimes there are valid codes with only 3 digits or one character followed by 2 or three digits. This makes the whole matter more complicated. But: I'll definitively take this into consideration.
                                    If I manage to solve this, even Richard Angell's idea with the T/F array might be possible and useful for the large search strings. T/F is not for all tables a good idea: there are 28 'searchedString's and 21 of those have less than 100 values – here the T/F idea does not make sense.

                                    There is a strong argument that speaks against INSTR search, even if I do not need warp speed: due to the nature of my searches, most of them will fail. And INSTR search produces the worst runtime results exactly for this situation. This is clearly in favor of binary search.


                                    I'll try to represent my data as numbers in order to speed up the search. I think John Gleason solution with inline assembly to initialize the tables combined with a binary search will produce the best results taking developing time also into regard. This is valid for searches on strings or numbers.
                                    Later I could optimize it with the T/F idea for large tables.

                                    Thank you for all your valuable input!

                                    Comment


                                    • #19
                                      @John
                                      I would greatly appreciate your full working example!

                                      Comment


                                      • #20
                                        Code:
                                        #COMPILE EXE
                                        #DIM ALL
                                        
                                        FUNCTION PBMAIN () AS LONG             'fully working !db example
                                        
                                           LOCAL ssdPtr AS STRING PTR * 60  '<< up to as many characters your string actually has. Here I take all 60
                                           LOCAL x AS LONG
                                        
                                           'here's your !db (data in byte format) section. It is a lot like simply using
                                           'CHR$(59, 48,48,53,48, 59, 48,48,53,49, 59,...) but data gets included at compile time
                                           'and is incorporated right into your DLL or EXE.
                                           'Since it is often hundreds of lines, I generally write a little program to create
                                           'these !db lines from a file containing the data you want to convert into a string.
                                        
                                         GOTO passDBdata  'don't let program read into your !db section or you'll get an illegal operation or other unusual effect.
                                         searchStringData:
                                         !db 59, 48,48,53,48, 59, 48,48,53,49, 59, 48,48,53,50, 59, 48,48,53,51, 59, 48,48,53,52, 59, 48,48,53,53, 59, 48,48,54,49, 59, 48,48,54,50
                                         !db 59, 48,48,54,51, 59, 48,48,54,52, 59, 48,48,54,53, 59, 48,48,54,54
                                         '... continued to as many lines as needed up to 64000+ lines, and can be 250+ chrs long each.
                                         passDBdata:
                                        
                                           ssdPtr = CODEPTR(searchStringData)  'points to first byte of your !db data
                                           x = INSTR(@ssdPtr, ";0061;")        'search for your substring
                                           ? "String found at byte " & FORMAT$(x)
                                           x = INSTR(@ssdPtr, ";0066")         'search for your substring
                                           ? "String found at byte " & FORMAT$(x)
                                           x = INSTR(@ssdPtr, "0026")          'search for your substring
                                           ? "String found at byte " & FORMAT$(x) & ", which is actually string not found."
                                        
                                           'Also, note you do not need to search from the beginning of searchStringData. You could
                                           'use something like ssdPtr2 = ssdPtr + 20, then the 21st character in the string would be
                                           'the first byte pointed to by ssdPtr2. Added: This would be useful in, for example, the previously
                                           'mentioned binary search technique. 
                                        
                                        END FUNCTION
                                        Last edited by John Gleason; 18 Oct 2007, 01:29 PM. Reason: added line in red

                                        Comment

                                        Working...
                                        X