Looking for lots of words in lots of files

Robert Bossy Robert.Bossy at jouy.inra.fr
Wed Jun 18 17:42:50 CEST 2008


I forgot to mention another way: put one thousand monkeys to work on it. ;)

RB

Robert Bossy wrote:
> brad wrote:
>> Just wondering if anyone has ever solved this efficiently... not 
>> looking for specific solutions tho... just ideas.
>>
>> I have one thousand words and one thousand files. I need to read the 
>> files to see if some of the words are in the files. I can stop 
>> reading a file once I find 10 of the words in it. It's easy for me to 
>> do this with a few dozen words, but a thousand words is too large for 
>> an RE and too inefficient to loop, etc. Any suggestions?
> The quick answer would be:
>    grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
> where WORDLIST is a file containing the thousand words, one per line.
>
> The more interesting answers would be to use either a suffix tree or 
> an Aho-Corasick graph.
>
> - The suffix tree is a representation of the target string (your 
> files) that allows to search quickly for a word. Your problem would 
> then be solved by 1) building a suffix tree for your files, and 2) 
> search for each word sequentially in the suffix tree.
>
> - The Aho-Corasick graph is a representation of the query word list 
> that allows fast scanning of the words on a target string. Your 
> problem would then be solved by 1) building an Aho-Corasick graph for 
> the list of words, and 2) scan sequentially each file.
>
> The preference for using either one or the other depends on some 
> details of your problems: the expected size of target files, the rate 
> of overlaps between words in your list (are there common prefixes), 
> will you repeat the operation with another word list or another set of 
> files, etc. Personally, I'd lean towards Aho-Corasick, it is a matter 
> of taste; the kind of applications that comes to my mind makes it more 
> practical.
>
> Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can 
> find modules for building both data structures in the python package 
> index.
>
> Cheers,
> RB
> -- 
> http://mail.python.org/mailman/listinfo/python-list
>




More information about the Python-list mailing list