There is an article in Dr. Dobbs Journal about suffix arrays that
might be of use to you.
I don't always understand the articles but this looks like something
you might look at.
------------------
Three things are certain.
Death, Taxes and Data Loss.
Guess which one you have!!
Announcement
Collapse
No announcement yet.
Indexing Text Files on the Fly!
Collapse
X
-
Guest repliedHi Mike,
You could use VB/ISAM and store each file as a record
or each paragraph in the record (up to 65k bytes each).
The ad-on VTOOL contains a function called VtSearch,
that will do full text searchs at over 2MB per second.
It uses a simple form of a query language that allows
for "and", "or", and "not" operators. In addition, you
can specify and filter hits that "Begin with", "Contain"
or are exactly "Equal" to the search criteria.
You can send me Email if you want to discuss it more off
line.
-Tony
------------------
Leave a comment:
-
-
Mike,
If the files don't change, then presumably you'd want to build a catalog of keywords once and for all, then use that to point any searches in the right place.
This looks like it might take the grunt out of that particular job - http://www.rjcw.freeserve.co.uk/index.htm .
HTH -
Paul
Leave a comment:
-
-
Mattias,
Most of these text files range between 10k and 30k, not very
large at all. These files will not change. They are fixed.
It's the number of files (4500) that slows me down...
..4500 reads
..4500 parsing
..4500 writes
Gonna find a better way....
I wonder if this would be faster...
..read x number of files into a buffer
..buf reaches a certain limit
..parse and write to disk
..loop until all files have been read
Thanks for the help guys!
dtSearch is almost a $1000 for dll control! (WOW)
$185 just for the program!
Search32 has a fast indexing set of (2) dll's, but their requirements
are unreasonable. Price is $142.50, good for only 5 distributions.
You must purchase another (5) $142.50 or you can purchase 1 at a
time for $37.05. Furthermore, when distributing the program, you
must include there program also! A major disappointment! Their
tech-support is email only, a Russian company out of Moscow. The
license agreement is confusing, but I just got this info from them.
You could get full use of the dll's, but then that's where the
price jumps to a $1000 like dtSearch.
Isysesktop - Although it looks and sounds good... unable to find
any price listing on this product out of Canada.
Some of the others I've seen I won't mention because they can't
do it any fast than my own coding!
Once again, thanks for the help!
Mike
Leave a comment:
-
-
About how many total words do these 40 files contain?
Does the index need to be rebuilt each time the program runs, or can you build it once, maintain it on the fly, save it to disk when the program ends/reload next run, and rebuild "on demand?"
Will you search only for equality, or do you intend to search for words "starting with...?"
Depending on these environmental and application factors, something like a hash-total index might be useful.
MCM
Leave a comment:
-
-
Thanks All, for the responses...
Colin, can you email me an example or what you mean!
Currently, I'm accessing about 4500 text files, a total of
40+MB of data.
Currently with PB, (and my first attempt at this) and the 4500
files above, it takes about 20 minutes to create a word-index-database.
And this of couse, is without using a dictionary.
Would a dictionary of specialized words speed things up?
I just wanted to get some feedback on how some of you would
approach the idea!
I have a search-tool program that will read those 4500 Text
files (40MB), Read, Write and Index them in 90 seconds. And of
course the author of the program won't sell the source-code!
He's probably using assembly!
However, I'm only about 18 minutes away from him.So in my
spare time, I will dabble with ideas until I find a better
solution.
Thanks again to all who responded!
Mike
[email protected]
Leave a comment:
-
-
Michael,
-How big (in MB) would the total size of the files be?
-How fast is the average computer in question?
-Are the files always accessed over a network?
--If so what kind of speed on the network?
The reason I'm asking is that if you want it to be updated
very often anyway, would it be better to create an intelligent
and supper fast search engine instead of an index? A search vs
and index would also be less limiting to the procedure of
adding and changing the source documents.
If so I have lots of example code to start with, such as a
dictionary parser that loads up 4,500 keywords and their
definitions in no time flat.
If this is completely the wrong approach as you said to start
with, then sorry! Just a second look.
Colin Schmidt
------------------
Colin Schmidt & James Duffy, Praxis Enterprises, Canada
[email protected]
Leave a comment:
-
-
Isys is a document indexer, search and retrieval program widely
used with county governments to make meeting and hearing minutes
searchable and easy for the public to access.
It works well for us.
http://www.isysdev.com/products/desktop.shtml
for the desktop trial download.
Try it out. or if you want help with writing an indexer for
posting in the souce code section zap me an email.
[email protected]
------------------
Leave a comment:
-
-
Michael;
if Asksam is close to what you need AskPingi a free Opensource
port to linux written in C give you some ideas how to start.
Ralph
------------------
Leave a comment:
-
-
Thanks for the response!
Here's my part of my problem!
1 Standalone tools, usually used for back-of-the-book indexes,
allow indexers to work from programmed-page-numbered keys.
The indexing is completely separate from the clients data.
2 Embedding tools allow indexing codes to be embedded in the
file, and allow the index locators to be updated as the
text changes.
3 Tagging tools allow indexing codes to be embedded in the files
after the indexing is complete. The indexer inserts numbered
dummy tags in the files, and then builds the index separately.
4 Keywording is primarily hard-coded jumps, similar to HTML jumps,
or it can be inserted as embedded coding and compiled into a list
by your software.
5 Weighted-text search tools, similar to the intelligence in agents
or Microsoft's Office Assistant, usually involve building terminology
sets for helping the intelligence work. (beyong my means and costs)
6 Automated indexing software builds a concordance, or a word list,
from processed files. Although you can claim these programs build
indexes, the actual results are a list of words and phrases, sometimes
useful in the beginning stages of building and index. Usability tests
of these programming tools have shown that the word lists omit many
key phrases, and cannot fine-tune terminology for easy retrieval,
or build the needed hierarchies of ideas that you and I can do from
scatch!
With all that being said.....
Many of these tools are developed in-house to fit the programmers needs,
or whatever the task is at hand. However, creating such a tool or dll would
seem foolish if it can only be used for one job!
Any other thoughts from you Guru's ?
Thanks
Mike
Leave a comment:
-
-
Hi Michael,
This looks promising - http://www.dtsearch.com/
[This message has been edited by Paul Noble (edited March 09, 2001).]
Leave a comment:
-
-
import them into SQL server!
Honestly I don't know but I'm going to follow this thread because I'm curious.
My first thought though would be to get the data out of the "thousands of text files" does it need to stay in that format?
cheers
------------------
Paul Dwyer
Network Engineer
Aussie in Tokyo
Leave a comment:
-
-
Indexing Text Files on the Fly!
Hi,
I have a client with 1000's of text-based files...
My Question:
How would one begin to create an idex(s) for key-words in the text
file, without considering (the, and, is, to as keywords) etc...
and do it on the fly or update it periodically.
Using an idex on ten(s) of thousands of text files would be much
quicker than searching each file 1 by 1.....
Any thoughts?
Thanks
MWM
Tags: None
-
Leave a comment: