newbie: grep -1f file1 file2 in Python

Alex Martelli aleaxit at yahoo.com
Tue Jan 16 08:26:05 EST 2001


<buster2642 at my-deja.com> wrote in message
news:9402sg$d2n$1 at nnrp1.deja.com...
> A common search I do with a shell (csh) script is
>
> % grep -1f file1 file2 > file3
>
> where file1 is a list, one item per line, of items
> to match in file2, displaying both the line before
> and the line after the matched line, writing all 3
> lines to file3.
>
> Pydiom suggestion?  Thanks!

One possibility, depending on what you need to do when
(e.g.) two consecutive lines match, etc:

import re, sys

if len(sys.argv)!=4:
    print "Usage: %s file1 file2 file3" % sys.argv[0]
    sys.exit(1)

res = re.compile('|'.join(
    [line[:-1] for line in open(sys.argv[1]).readlines()]))
out = open(sys.argv[3], "w")

lines = ['==Start\n'] + open(sys.argv[2]).readlines() + ['==Finis\n']
for i in range(1,len(lines)-1):
    if res.search(lines[i]):
        out.writelines(lines[i-1:i+2])
out.close()


Now this is pretty rough -- it assumes the regular expressions
listed, one per line, in file1, will have no unparenthesized '|'
(one could parenthesize to remedy), it will output artificial
'pseudolines' of '==Start' and '==Finis' for matches at first
and last lines as 'previous'/'following', it will output sets
of 3 lines including repetitions when two consecutive lines
match, no line numbers, no indication of which of the re's in
file1 caused the matching, etc, etc.

However, this is the kind of rough-and-ready sketch one tends
to start with, experimentally, before iteratively refinining
this until the desired results are obtained.  Python makes it
very easy to enter such productive 'tinkering' mode -- rather
than hoping to nail down all of the specs BEFORE coding, one
successively refines various prototypes.  This is a must if
the specs must come from somebody else -- as they see various
possible input-to-output mappings, they'll actually _change_
their ideas about what exactly it is that they want -- but,
even when working alone, it's quite handy for most tasks.

For example, we might decide that we never want a line output
twice; we do want line numbers, but don't care about what re
matched; and we want to write to standard output, rather than
give an error, if only 2 files are named -- and read from
standard input, rather than give an error, if only 1 file
(the one containing the RE's) is named.

Mutating the previous sketch to one that handles command line
arguments better suggest a little refactoring to handle more
explicitly the various files that are in play -- this will
also have the good side effect of giving any errors such as
'file not found' as soon as possible:

import re, sys

argc = len(sys.argv)
if argc<2 or argc>4:
    print "Usage: %s file1 [file2 [file3]]" % sys.argv[0]
    sys.exit(1)

file_res = open(sys.argv[1])
if argc<3: file_inp = sys.stdin
else: file_inp = open(sys.argv[2])
if argc<4: file_out = sys.stdout
else: file_out = open(sys.argv[3])

res = re.compile('|'.join(
    ['('+line[:-1]+')' for line in file_res.readlines()]))

lines = ['==Start\n'] + file_inp.readlines() + ['==Finis\n']
for i in range(1,len(lines)-1):
    if res.search(lines[i]):
        file_out.writelines(lines[i-1:i+2])
file_out.close()


This is a little bit better -- although more refinement
may also be needed around the re.compile, to diagnose
things properly if invalid re's are in file1 -- but we'll
focus on the 'no lines are to be output twice' requirement.

Thinking about doing it in a single sequential pass shows
up some complication -- when we find a match, we have to
worry about whether the previous line, and/or this one,
have already been output.  Why worry when we can avoid
it?  "Do the simplest thing that can possibly work" to the
rescue: let's do TWO sequential passes -- one just to find
out what will be printed, one to do the actual printing;
in the first one we just set flags (no problem if a flag
is set twice!-).  As 'flags' we can simply use another
'parallel' list of the same length, initially all 0's
(nothing that must be printed has yet been seen.  Changing
only the last part, from 'lines =' onwards, then:

lines = ['==Start\n'] + file_inp.readlines() + ['==Finis\n']
numlines = len(lines)
must_print = [0]*numlines
for i in range(1,numlines-1):
    if res.search(lines[i]):
        must_print[i-1:i+2] = [1]*3
for i in range(1,numlines-1):
    if must_print[i]:
        file_out.write("%d: %s" % (i,lines[i]))
file_out.close()

We've taken the opportunity to not-output the artificial
lines (we still keep them in the lines list so we need no
special tests at start or end, but we don't output them).

We may be closer to the desired results now -- but maybe
we also want a blank line output to delimit blocks of
consecutive lines from each other.  That's easy, we need
change only the final for loop:

for i in range(1,numlines-1):
    if must_print[i]:
        file_out.write("%d: %s" % (i,lines[i]))
        if i+1<numlines-1 and not must_print[i+1]:
            file_out.write("\n")

And so on, and so forth.  For example, we may decide the
output is now OK but the test before writing the blank
line separator is too messy -- let's simplify it:

must_print[numlines-1] = 0
for i in range(1,numlines-1):
    if must_print[i]:
        file_out.write("%d: %s" % (i,lines[i]))
        if not must_print[i+1]:
            file_out.write("\n")

Better, I think; the simplicity of our code is quite as
important as the exact format of the output, etc, since
no doubt sooner or later we'll decide to go back to our
script for some tweak, and, if we can't understand what
it's doing after a while has passed, we'll be in trouble!

For the same purpose, inserting docstrings and comments
may be good, although commenting code that is too complex
is NOT as good as making it simpler.


Alex






More information about the Python-list mailing list