Announcement

Collapse
No announcement yet.

Searching in strings > 32kb

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching in strings > 32kb

    (I promise this is my last question )
    I need to read a file that is normally between 10-12mb, and find a particular chunk in it. I can easily find it by looking for a string that is approx 30 bytes long. The problem is, the file is > 32kb and im using PBDOS so I cant read the whole file into a string like I would in PB/CC
    Im know im not the first person who's read a file > 32kb before, so I was hoping people might give me pointers as to how to go about finding a needle in a large haystack when I can only search a tiny bit of the haystack at a time?
    Thanks,
    Wayne


    ------------------
    -

  • #2
    The basic technique for searching large files is quite straight forward. Lets say you want to find a "search string" that is 10 bytes long.
    • Open the file to search in binary mode, and read the first block, say 32750 bytes (32750 bytes is the maximum string length in PB/DOS).
    • Use INSTR() on that string, if found, process it accordingly.
    • If not found, move the file pointer back 9 bytes (the search string length less one), and load another 32760 bytes.
    • Repeat from step 2 above.

    How does that sound?

    ------------------
    Lance
    PowerBASIC Support
    mailto:[email protected][email protected]</A>
    Lance
    mailto:[email protected]

    Comment


    • #3
      Lance, that sounds beautiful! And here's the code Ive come up with, it looks sensational to me but it refuses to work! It compiles fine though
      Code:
      $CPU 8086                 ' program works on any CPU
      $COMPILE EXE              ' compile to an EXE
      $INCLUDE "PB35.INC"       ' link library
        
      '[i]Function fInstr - search a file of any size for a string[/i]
      FUNCTION fInstr(sFile AS STRING, sSearchTxt AS STRING) AS LONG
      ON ERROR RESUME NEXT
      IF DIR$(sFile, 39) = "" THEN
         FUNCTION = 0
         EXIT FUNCTION
      END IF
      DIM sChunk AS STRING * 32750
      DIM hFile AS LONG, lFilePtr AS LONG, lRes AS LONG, ByteHop AS LONG
      hFile = Freefile
      ByteHop = 32750 - LEN(sSearchTxt)
      OPEN sFile FOR BINARY ACCESS READ AS hFile
       lFilePtr = 1
       DO
         GET #1, lFilePtr, sChunk
         lRes = INSTR(1, sChunk, sSearchTxt)
         IF lRes > 0 THEN
            CLOSE hFile
            FUNCTION = lRes
            EXIT FUNCTION
         END IF
         lFilePtr = lFilePtr + ByteHop
       LOOP
      CLOSE hFile
      FUNCTION = 0
      END FUNCTION
       
      ON ERROR RESUME NEXT
      PRINT "Starting..."
      'Search for the word "BIG" in c:\temp\bigfile.txt
      PRINT "fInstr = " & STR$(fInstr("c:\temp\bigfile.txt", "BIG"))
      Any idea why that is failing? It doesn't even print "Starting..." which is weird


      ------------------
      -

      Comment


      • #4
        Sorry, I cannot tell you... you said you would not ask any more questions!




        ------------------
        Lance
        PowerBASIC Support
        mailto:[email protected][email protected]</A>
        Lance
        mailto:[email protected]

        Comment


        • #5
          Well, ok, I made you wait long enough...

          I note that you used ON ERROR RESUME NEXT, but do no error testing. It is always a good idea $ERROR ALL ON to help the debugging effort.

          Anyway, I revised your code slightly... may not be 100% bulletproof, but it seems to work fine for me:
          Code:
          $CPU 8086                 ' program works on any CPU
          $COMPILE EXE              ' compile to an EXE
           
          CLS
          PRINT "Starting..."
          PRINT "fInstr = " & STR$(fInstr("F:\PBDLL60\WINAPI\WIN32API.INC", "DeleteObject"))
           
          ' ==================================================================
          FUNCTION fInstr(sFile AS STRING, sSearchTxt AS STRING) LOCAL AS LONG
          '   ON ERROR RESUME NEXT
          
              IF ISFALSE LEN(DIR$(sFile, 7)) THEN EXIT FUNCTION
          
              DIM sChunk AS STRING
              DIM Chunk AS INTEGER
              DIM hFile AS INTEGER
              DIM lFilePtr AS LONG
              DIM lRes AS INTEGER
          
              hFile = FREEFILE
              lFilePtr = PBVBINBASE
              OPEN sFile FOR BINARY ACCESS READ AS #hFile
          
              Chunk = FRE(-4)
              DO
                  Chunk = MIN(Chunk, LOF(hFile) - lFilePtr)
                  GET$ #1, Chunk, sChunk
                  lRes = INSTR(sChunk, sSearchTxt)
                  IF lRes > 0 THEN
                      CLOSE hFile
                      FUNCTION = lRes
                      EXIT FUNCTION
                  END IF
                  INCR lFilePtr, Chunk - LEN(sSearchText) + 1
                  SEEK #hFile, lFilePtr
              LOOP UNTIL Chunk < FRE(-4)
              CLOSE hFile
              FUNCTION = 0
          END FUNCTION

          ------------------
          Lance
          PowerBASIC Support
          mailto:[email protected][email protected]</A>
          Lance
          mailto:[email protected]

          Comment


          • #6
            Thanks for your time Lance, this is really stumping me
            Shouldnt this: FUNCTION = lRes
            be this: FUNCTION = lRes + Chunk
            ?
            as otherwise you'll always return a value in the range of 0-32k
            Apart from that I think this is exactly what I need to get up and running, although I havent done any speed tests yet


            ------------------
            -

            Comment


            • #7
              Well, if you want to return the actual byte position (in accordance with OPTION BINARY BASE) of the match, it should really be FUNCTION = lFilePtr + lRes - 1. My example above simply returned TRUE (non-zero) to indicate that the match was made.

              You could probably optimize the code a little more by moving LOF() outside of the loop, and changing LOCK mode to LOCK READ WRITE (to allow PowerBASIC to use it's internal buffering facilities), etc.


              ------------------
              Lance
              PowerBASIC Support
              mailto:[email protected][email protected]</A>
              Lance
              mailto:[email protected]

              Comment


              • #8
                FUNCTION = lFilePtr + lRes - 1 , yes that was what I was after
                I just used it to look for a string located at the very end of a 13mb text file, it found it virtually instantly! *stoked*
                Thanks again very much for your time and work Lance, I should be able to take the training wheels off my PBDOS now
                Havvagoodweekend!


                ------------------
                -

                Comment


                • #9
                  Wayne:
                  If your problem isn't already solved, see Chapter 8 in Ethan
                  Winer's book, "Basic Techniques & utilities", Ziff-Davis Press,
                  1991. You can download a free copy of the book from a number
                  of web sites; try Ethan's site to begin with (www.ethanwiner.com).

                  This book deals with QuickBASIC, but the search functions given
                  will need little if any revision to run in PB.


                  ------------------

                  Comment


                  • #10
                    JFYI, Ethan Winers (circa 1995) book can be downloaded from the PowerBASIC web site at http://www.powerbasic.com/files/pub/docs/WINER.ZIP

                    ------------------
                    Lance
                    PowerBASIC Support
                    mailto:[email protected][email protected]</A>
                    Lance
                    mailto:[email protected]

                    Comment

                    Working...
                    X