Looking for lots of words in lots of files
Robert Bossy
Robert.Bossy at jouy.inra.fr
Wed Jun 18 11:42:50 EDT 2008
I forgot to mention another way: put one thousand monkeys to work on it. ;)
RB
Robert Bossy wrote:
> brad wrote:
>> Just wondering if anyone has ever solved this efficiently... not
>> looking for specific solutions tho... just ideas.
>>
>> I have one thousand words and one thousand files. I need to read the
>> files to see if some of the words are in the files. I can stop
>> reading a file once I find 10 of the words in it. It's easy for me to
>> do this with a few dozen words, but a thousand words is too large for
>> an RE and too inefficient to loop, etc. Any suggestions?
> The quick answer would be:
> grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
> where WORDLIST is a file containing the thousand words, one per line.
>
> The more interesting answers would be to use either a suffix tree or
> an Aho-Corasick graph.
>
> - The suffix tree is a representation of the target string (your
> files) that allows to search quickly for a word. Your problem would
> then be solved by 1) building a suffix tree for your files, and 2)
> search for each word sequentially in the suffix tree.
>
> - The Aho-Corasick graph is a representation of the query word list
> that allows fast scanning of the words on a target string. Your
> problem would then be solved by 1) building an Aho-Corasick graph for
> the list of words, and 2) scan sequentially each file.
>
> The preference for using either one or the other depends on some
> details of your problems: the expected size of target files, the rate
> of overlaps between words in your list (are there common prefixes),
> will you repeat the operation with another word list or another set of
> files, etc. Personally, I'd lean towards Aho-Corasick, it is a matter
> of taste; the kind of applications that comes to my mind makes it more
> practical.
>
> Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
> find modules for building both data structures in the python package
> index.
>
> Cheers,
> RB
> --
> http://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list