Efficient grep using Python?
Christos TZOTZIOY Georgiou
tzot at sil-tec.gr
Fri Dec 17 17:08:30 CET 2004
On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that P at draigBrady.com
might have written:
>> The point is that when you have 100,000s of records, this grep becomes
>> really slow?
>There are performance bugs with current versions of grep
>and multibyte characters that are only getting addressed now.
>To work around these do `export LANG=C` first.
You also should use the -F flag that Pádraig suggests, since you don't
have regular expressions in the B file.
>In my experience grep is not scalable since it's O(n^2).
>See below (note A and B are randomized versions of
>/usr/share/dict/words (and therefore worst case for the
>$ wc -l A B
> 45427 A
> 45427 B
>$ export LANG=C
>$ time grep -Fvf B A
>$ time sort A B B | uniq -u
>$ rpm -q grep coreutils
sf, you better do your own benchmarks (there is quick, sample code in
other posts of mine and Pádraig's) on your machine, since on my test
machine the numbers are reversed re to these of Pádraig's (grep takes
half the time).
package versions (on SuSE 9.1 64-bit):
$ rpm -q grep coreutils
$ echo $LANG
Caution: both solutions are interexchangeable as long as you don't have
duplicate lines in the A file. If you do, use the grep version.
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
More information about the Python-list