Announcement

Collapse
No announcement yet.

Indexing Text Files on the Fly!

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing Text Files on the Fly!

    Hi,

    I have a client with 1000's of text-based files...

    My Question:
    How would one begin to create an idex(s) for key-words in the text
    file, without considering (the, and, is, to as keywords) etc...
    and do it on the fly or update it periodically.

    Using an idex on ten(s) of thousands of text files would be much
    quicker than searching each file 1 by 1.....

    Any thoughts?

    Thanks
    MWM

    mwm

  • #2
    import them into SQL server!

    Honestly I don't know but I'm going to follow this thread because I'm curious.
    My first thought though would be to get the data out of the "thousands of text files" does it need to stay in that format?

    cheers

    ------------------
    Paul Dwyer
    Network Engineer
    Aussie in Tokyo

    Comment


    • #3
      Hi Michael,

      This looks promising - http://www.dtsearch.com/



      [This message has been edited by Paul Noble (edited March 09, 2001).]
      Zippety Software, Home of the Lynx Project Explorer
      http://www.zippety.net
      My e-mail

      Comment


      • #4
        Thanks for the response!

        Here's my part of my problem!

        1 Standalone tools, usually used for back-of-the-book indexes,
        allow indexers to work from programmed-page-numbered keys.
        The indexing is completely separate from the clients data.

        2 Embedding tools allow indexing codes to be embedded in the
        file, and allow the index locators to be updated as the
        text changes.

        3 Tagging tools allow indexing codes to be embedded in the files
        after the indexing is complete. The indexer inserts numbered
        dummy tags in the files, and then builds the index separately.

        4 Keywording is primarily hard-coded jumps, similar to HTML jumps,
        or it can be inserted as embedded coding and compiled into a list
        by your software.

        5 Weighted-text search tools, similar to the intelligence in agents
        or Microsoft's Office Assistant, usually involve building terminology
        sets for helping the intelligence work. (beyong my means and costs)

        6 Automated indexing software builds a concordance, or a word list,
        from processed files. Although you can claim these programs build
        indexes, the actual results are a list of words and phrases, sometimes
        useful in the beginning stages of building and index. Usability tests
        of these programming tools have shown that the word lists omit many
        key phrases, and cannot fine-tune terminology for easy retrieval,
        or build the needed hierarchies of ideas that you and I can do from
        scatch!

        With all that being said.....

        Many of these tools are developed in-house to fit the programmers needs,
        or whatever the task is at hand. However, creating such a tool or dll would
        seem foolish if it can only be used for one job!

        Any other thoughts from you Guru's ?

        Thanks
        Mike


        mwm

        Comment


        • #5
          Michael;

          if Asksam is close to what you need AskPingi a free Opensource
          port to linux written in C give you some ideas how to start.

          Ralph

          ------------------

          Comment


          • #6
            Isys is a document indexer, search and retrieval program widely
            used with county governments to make meeting and hearing minutes
            searchable and easy for the public to access.
            It works well for us.
            http://www.isysdev.com/products/desktop.shtml
            for the desktop trial download.

            Try it out. or if you want help with writing an indexer for
            posting in the souce code section zap me an email.

            [email protected]




            ------------------

            Comment


            • #7
              Michael,

              -How big (in MB) would the total size of the files be?
              -How fast is the average computer in question?
              -Are the files always accessed over a network?
              --If so what kind of speed on the network?

              The reason I'm asking is that if you want it to be updated
              very often anyway, would it be better to create an intelligent
              and supper fast search engine instead of an index? A search vs
              and index would also be less limiting to the procedure of
              adding and changing the source documents.

              If so I have lots of example code to start with, such as a
              dictionary parser that loads up 4,500 keywords and their
              definitions in no time flat.

              If this is completely the wrong approach as you said to start
              with, then sorry! Just a second look.

              Colin Schmidt

              ------------------
              Colin Schmidt & James Duffy, Praxis Enterprises, Canada
              [email protected]

              Comment


              • #8
                Thanks All, for the responses...

                Colin, can you email me an example or what you mean!

                Currently, I'm accessing about 4500 text files, a total of
                40+MB of data.

                Currently with PB, (and my first attempt at this) and the 4500
                files above, it takes about 20 minutes to create a word-index-database.
                And this of couse, is without using a dictionary.

                Would a dictionary of specialized words speed things up?

                I just wanted to get some feedback on how some of you would
                approach the idea!

                I have a search-tool program that will read those 4500 Text
                files (40MB), Read, Write and Index them in 90 seconds. And of
                course the author of the program won't sell the source-code!
                He's probably using assembly!

                However, I'm only about 18 minutes away from him. So in my
                spare time, I will dabble with ideas until I find a better
                solution.


                Thanks again to all who responded!

                Mike
                [email protected]


                mwm

                Comment


                • #9
                  About how many total words do these 40 files contain?

                  Does the index need to be rebuilt each time the program runs, or can you build it once, maintain it on the fly, save it to disk when the program ends/reload next run, and rebuild "on demand?"

                  Will you search only for equality, or do you intend to search for words "starting with...?"

                  Depending on these environmental and application factors, something like a hash-total index might be useful.

                  MCM
                  Michael Mattias
                  Tal Systems Inc. (retired)
                  Racine WI USA
                  [email protected]
                  http://www.talsystems.com

                  Comment


                  • #10
                    Mattias,

                    Most of these text files range between 10k and 30k, not very
                    large at all. These files will not change. They are fixed.
                    It's the number of files (4500) that slows me down...

                    ..4500 reads
                    ..4500 parsing
                    ..4500 writes

                    Gonna find a better way....

                    I wonder if this would be faster...

                    ..read x number of files into a buffer
                    ..buf reaches a certain limit
                    ..parse and write to disk
                    ..loop until all files have been read

                    Thanks for the help guys!

                    dtSearch is almost a $1000 for dll control! (WOW)
                    $185 just for the program!

                    Search32 has a fast indexing set of (2) dll's, but their requirements
                    are unreasonable. Price is $142.50, good for only 5 distributions.
                    You must purchase another (5) $142.50 or you can purchase 1 at a
                    time for $37.05. Furthermore, when distributing the program, you
                    must include there program also! A major disappointment! Their
                    tech-support is email only, a Russian company out of Moscow. The
                    license agreement is confusing, but I just got this info from them.
                    You could get full use of the dll's, but then that's where the
                    price jumps to a $1000 like dtSearch.

                    Isys esktop - Although it looks and sounds good... unable to find
                    any price listing on this product out of Canada.

                    Some of the others I've seen I won't mention because they can't
                    do it any fast than my own coding!

                    Once again, thanks for the help!

                    Mike

                    mwm

                    Comment


                    • #11
                      Mike,

                      If the files don't change, then presumably you'd want to build a catalog of keywords once and for all, then use that to point any searches in the right place.

                      This looks like it might take the grunt out of that particular job - http://www.rjcw.freeserve.co.uk/index.htm .

                      HTH -

                      Paul
                      Zippety Software, Home of the Lynx Project Explorer
                      http://www.zippety.net
                      My e-mail

                      Comment


                      • #12
                        Hi Mike,

                        You could use VB/ISAM and store each file as a record
                        or each paragraph in the record (up to 65k bytes each).
                        The ad-on VTOOL contains a function called VtSearch,
                        that will do full text searchs at over 2MB per second.
                        It uses a simple form of a query language that allows
                        for "and", "or", and "not" operators. In addition, you
                        can specify and filter hits that "Begin with", "Contain"
                        or are exactly "Equal" to the search criteria.

                        You can send me Email if you want to discuss it more off
                        line.

                        -Tony

                        ------------------

                        Comment


                        • #13
                          There is an article in Dr. Dobbs Journal about suffix arrays that
                          might be of use to you.
                          I don't always understand the articles but this looks like something
                          you might look at.


                          ------------------
                          Three things are certain.
                          Death, Taxes and Data Loss.
                          Guess which one you have!!
                          Warped by the rain, Driven by the snow...

                          jimatluv2rescue.com

                          Comment

                          Working...
                          X