No announcement yet.

Indexing Text Files on the Fly!

  • Filter
  • Time
  • Show
Clear All
new posts

  • Jim Padgett
    There is an article in Dr. Dobbs Journal about suffix arrays that
    might be of use to you.
    I don't always understand the articles but this looks like something
    you might look at.

    Three things are certain.
    Death, Taxes and Data Loss.
    Guess which one you have!!

    Leave a comment:

  • Tony Altwies
    Guest replied
    Hi Mike,

    You could use VB/ISAM and store each file as a record
    or each paragraph in the record (up to 65k bytes each).
    The ad-on VTOOL contains a function called VtSearch,
    that will do full text searchs at over 2MB per second.
    It uses a simple form of a query language that allows
    for "and", "or", and "not" operators. In addition, you
    can specify and filter hits that "Begin with", "Contain"
    or are exactly "Equal" to the search criteria.

    You can send me Email if you want to discuss it more off



    Leave a comment:

  • Paul Noble

    If the files don't change, then presumably you'd want to build a catalog of keywords once and for all, then use that to point any searches in the right place.

    This looks like it might take the grunt out of that particular job - .

    HTH -


    Leave a comment:

  • Michael Meeks

    Most of these text files range between 10k and 30k, not very
    large at all. These files will not change. They are fixed.
    It's the number of files (4500) that slows me down...

    ..4500 reads
    ..4500 parsing
    ..4500 writes

    Gonna find a better way....

    I wonder if this would be faster... x number of files into a buffer
    ..buf reaches a certain limit
    ..parse and write to disk
    ..loop until all files have been read

    Thanks for the help guys!

    dtSearch is almost a $1000 for dll control! (WOW)
    $185 just for the program!

    Search32 has a fast indexing set of (2) dll's, but their requirements
    are unreasonable. Price is $142.50, good for only 5 distributions.
    You must purchase another (5) $142.50 or you can purchase 1 at a
    time for $37.05. Furthermore, when distributing the program, you
    must include there program also! A major disappointment! Their
    tech-support is email only, a Russian company out of Moscow. The
    license agreement is confusing, but I just got this info from them.
    You could get full use of the dll's, but then that's where the
    price jumps to a $1000 like dtSearch.

    Isys esktop - Although it looks and sounds good... unable to find
    any price listing on this product out of Canada.

    Some of the others I've seen I won't mention because they can't
    do it any fast than my own coding!

    Once again, thanks for the help!


    Leave a comment:

  • Michael Mattias
    About how many total words do these 40 files contain?

    Does the index need to be rebuilt each time the program runs, or can you build it once, maintain it on the fly, save it to disk when the program ends/reload next run, and rebuild "on demand?"

    Will you search only for equality, or do you intend to search for words "starting with...?"

    Depending on these environmental and application factors, something like a hash-total index might be useful.


    Leave a comment:

  • Michael Meeks
    Thanks All, for the responses...

    Colin, can you email me an example or what you mean!

    Currently, I'm accessing about 4500 text files, a total of
    40+MB of data.

    Currently with PB, (and my first attempt at this) and the 4500
    files above, it takes about 20 minutes to create a word-index-database.
    And this of couse, is without using a dictionary.

    Would a dictionary of specialized words speed things up?

    I just wanted to get some feedback on how some of you would
    approach the idea!

    I have a search-tool program that will read those 4500 Text
    files (40MB), Read, Write and Index them in 90 seconds. And of
    course the author of the program won't sell the source-code!
    He's probably using assembly!

    However, I'm only about 18 minutes away from him. So in my
    spare time, I will dabble with ideas until I find a better

    Thanks again to all who responded!

    [email protected]

    Leave a comment:

  • Colin Schmidt

    -How big (in MB) would the total size of the files be?
    -How fast is the average computer in question?
    -Are the files always accessed over a network?
    --If so what kind of speed on the network?

    The reason I'm asking is that if you want it to be updated
    very often anyway, would it be better to create an intelligent
    and supper fast search engine instead of an index? A search vs
    and index would also be less limiting to the procedure of
    adding and changing the source documents.

    If so I have lots of example code to start with, such as a
    dictionary parser that loads up 4,500 keywords and their
    definitions in no time flat.

    If this is completely the wrong approach as you said to start
    with, then sorry! Just a second look.

    Colin Schmidt

    Colin Schmidt & James Duffy, Praxis Enterprises, Canada
    [email protected]

    Leave a comment:

  • Lane Weast
    Isys is a document indexer, search and retrieval program widely
    used with county governments to make meeting and hearing minutes
    searchable and easy for the public to access.
    It works well for us.
    for the desktop trial download.

    Try it out. or if you want help with writing an indexer for
    posting in the souce code section zap me an email.

    [email protected]


    Leave a comment:

  • Ralph Berger

    if Asksam is close to what you need AskPingi a free Opensource
    port to linux written in C give you some ideas how to start.



    Leave a comment:

  • Michael Meeks
    Thanks for the response!

    Here's my part of my problem!

    1 Standalone tools, usually used for back-of-the-book indexes,
    allow indexers to work from programmed-page-numbered keys.
    The indexing is completely separate from the clients data.

    2 Embedding tools allow indexing codes to be embedded in the
    file, and allow the index locators to be updated as the
    text changes.

    3 Tagging tools allow indexing codes to be embedded in the files
    after the indexing is complete. The indexer inserts numbered
    dummy tags in the files, and then builds the index separately.

    4 Keywording is primarily hard-coded jumps, similar to HTML jumps,
    or it can be inserted as embedded coding and compiled into a list
    by your software.

    5 Weighted-text search tools, similar to the intelligence in agents
    or Microsoft's Office Assistant, usually involve building terminology
    sets for helping the intelligence work. (beyong my means and costs)

    6 Automated indexing software builds a concordance, or a word list,
    from processed files. Although you can claim these programs build
    indexes, the actual results are a list of words and phrases, sometimes
    useful in the beginning stages of building and index. Usability tests
    of these programming tools have shown that the word lists omit many
    key phrases, and cannot fine-tune terminology for easy retrieval,
    or build the needed hierarchies of ideas that you and I can do from

    With all that being said.....

    Many of these tools are developed in-house to fit the programmers needs,
    or whatever the task is at hand. However, creating such a tool or dll would
    seem foolish if it can only be used for one job!

    Any other thoughts from you Guru's ?


    Leave a comment:

  • Paul Noble
    Hi Michael,

    This looks promising -

    [This message has been edited by Paul Noble (edited March 09, 2001).]

    Leave a comment:

  • Paul Dwyer
    import them into SQL server!

    Honestly I don't know but I'm going to follow this thread because I'm curious.
    My first thought though would be to get the data out of the "thousands of text files" does it need to stay in that format?


    Paul Dwyer
    Network Engineer
    Aussie in Tokyo

    Leave a comment:

  • Michael Meeks
    started a topic Indexing Text Files on the Fly!

    Indexing Text Files on the Fly!


    I have a client with 1000's of text-based files...

    My Question:
    How would one begin to create an idex(s) for key-words in the text
    file, without considering (the, and, is, to as keywords) etc...
    and do it on the fly or update it periodically.

    Using an idex on ten(s) of thousands of text files would be much
    quicker than searching each file 1 by 1.....

    Any thoughts?