Efficient grep using Python?

Christos TZOTZIOY Georgiou tzot at sil-tec.gr
Fri Dec 17 17:08:30 CET 2004


On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that P at draigBrady.com
might have written:

sf:

>sf wrote:
>> The point is that when you have 100,000s of records, this grep becomes
>> really slow?
>
>There are performance bugs with current versions of grep
>and multibyte characters that are only getting addressed now.
>To work around these do `export LANG=C` first.

You also should use the -F flag that Pádraig suggests, since you don't
have regular expressions in the B file.

>In my experience grep is not scalable since it's O(n^2).
>See below (note A and B are randomized versions of
>/usr/share/dict/words (and therefore worst case for the
>sort method)).
>
>$ wc -l A B
>   45427 A
>   45427 B
>
>$ export LANG=C
>
>$ time grep -Fvf B A
>real    0m0.437s
>
>$ time sort A B B | uniq -u
>real    0m0.262s
>
>$ rpm -q grep coreutils
>grep-2.5.1-16.1
>coreutils-4.5.3-19

sf, you better do your own benchmarks (there is quick, sample code in
other posts of mine and Pádraig's) on your machine, since on my test
machine the numbers are reversed re to these of Pádraig's (grep takes
half the time).

package versions (on SuSE 9.1 64-bit):

$ rpm -q grep coreutils
grep-2.5.1-427
coreutils-5.2.1-21

language:
$ echo $LANG
en_US.UTF-8

Caution: both solutions are interexchangeable as long as you don't have
duplicate lines in the A file.  If you do, use the grep version.
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...



More information about the Python-list mailing list