Looking for lots of words in lots of files
Robert.Bossy at jouy.inra.fr
Wed Jun 18 17:42:50 CEST 2008
I forgot to mention another way: put one thousand monkeys to work on it. ;)
Robert Bossy wrote:
> brad wrote:
>> Just wondering if anyone has ever solved this efficiently... not
>> looking for specific solutions tho... just ideas.
>> I have one thousand words and one thousand files. I need to read the
>> files to see if some of the words are in the files. I can stop
>> reading a file once I find 10 of the words in it. It's easy for me to
>> do this with a few dozen words, but a thousand words is too large for
>> an RE and too inefficient to loop, etc. Any suggestions?
> The quick answer would be:
> grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
> where WORDLIST is a file containing the thousand words, one per line.
> The more interesting answers would be to use either a suffix tree or
> an Aho-Corasick graph.
> - The suffix tree is a representation of the target string (your
> files) that allows to search quickly for a word. Your problem would
> then be solved by 1) building a suffix tree for your files, and 2)
> search for each word sequentially in the suffix tree.
> - The Aho-Corasick graph is a representation of the query word list
> that allows fast scanning of the words on a target string. Your
> problem would then be solved by 1) building an Aho-Corasick graph for
> the list of words, and 2) scan sequentially each file.
> The preference for using either one or the other depends on some
> details of your problems: the expected size of target files, the rate
> of overlaps between words in your list (are there common prefixes),
> will you repeat the operation with another word list or another set of
> files, etc. Personally, I'd lean towards Aho-Corasick, it is a matter
> of taste; the kind of applications that comes to my mind makes it more
> Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
> find modules for building both data structures in the python package
More information about the Python-list