Looking for lots of words in lots of files
Robert Bossy
Robert.Bossy at jouy.inra.fr
Wed Jun 18 11:39:01 EDT 2008
brad wrote:
> Just wondering if anyone has ever solved this efficiently... not
> looking for specific solutions tho... just ideas.
>
> I have one thousand words and one thousand files. I need to read the
> files to see if some of the words are in the files. I can stop reading
> a file once I find 10 of the words in it. It's easy for me to do this
> with a few dozen words, but a thousand words is too large for an RE
> and too inefficient to loop, etc. Any suggestions?
The quick answer would be:
grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
where WORDLIST is a file containing the thousand words, one per line.
The more interesting answers would be to use either a suffix tree or an
Aho-Corasick graph.
- The suffix tree is a representation of the target string (your files)
that allows to search quickly for a word. Your problem would then be
solved by 1) building a suffix tree for your files, and 2) search for
each word sequentially in the suffix tree.
- The Aho-Corasick graph is a representation of the query word list that
allows fast scanning of the words on a target string. Your problem would
then be solved by 1) building an Aho-Corasick graph for the list of
words, and 2) scan sequentially each file.
The preference for using either one or the other depends on some details
of your problems: the expected size of target files, the rate of
overlaps between words in your list (are there common prefixes), will
you repeat the operation with another word list or another set of files,
etc. Personally, I'd lean towards Aho-Corasick, it is a matter of taste;
the kind of applications that comes to my mind makes it more practical.
Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
find modules for building both data structures in the python package index.
Cheers,
RB
More information about the Python-list
mailing list