Efficient grep using Python?
Christos TZOTZIOY Georgiou
tzot at sil-tec.gr
Thu Dec 16 17:31:53 CET 2004
On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P at draigBrady.com
might have written:
>>>>Essentially, want to do efficient grep, i..e from A remove those lines which
>>>>are also present in file B.
[p at draig]
>>>You could implement elegantly using the new sets feature
>>>For reference here is the unix way to do it:
>>>sort a b b | uniq -u
>> No, like I just wrote in another post, he wants
>> $ grep -vf B A
>> I think that
>> $ sort A B B | uniq -u
>> can be abbreviated to
>> $ sort -u A B B
>> which is the union rather than the intersection of the files
[P at draig]
>wrong. Notice the -u option to uniq.
I see your point. That's a new to me use of uniq, since I started using
Unices long before GNU versions of the tools, but then, I might have
missed the -u option.
$ cat A
$ cat B
$ time sort A B B | uniq -u
$ time grep -vf B A
So I stand corrected that your solution does *not* give the union.
>> wastes some time by considering B twice
>I challenge you to a benchmark :-)
Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :). Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> open("/tmp/A", "w").writelines(x)
>>> open("/tmp/B", "w").writelines(x[:1000])
$ time sort A B B | uniq -u >/dev/null
$ time grep -Fvf B A >/dev/null
(Yes, I cheated by adding the F (for no regular expressions) flag :)
>> and finally destroys original line
>> order (should it be important).
That's our final agreement :)
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
More information about the Python-list