[Tutor] Finding a specific line in a body of text

Mon Mar 12 04:28:07 CET 2012

On Mon, Mar 12, 2012 at 02:56:36AM +0100, Robert Sjoblom wrote:

> In the file I'm parsing, I'm looking for specific lines. I don't know
> the content of these lines but I do know the content that appears two
> lines before. As such I thought that maybe I'd flag for a found line
> and then flag the next two lines as well, like so:
> 
> if keyword in line:
>   flag = 1
>   continue
> if flag == 1 or flag == 2:
>   if flag == 1:
>     flag = 2
>     continue
>   if flag == 2:
>     list.append(line)

You haven't shown us the critical part: how are you getting the lines in 
the first place?

(Also, you shouldn't shadow built-ins like list as you do above, unless 
you know what you are doing. If you have to ask "what's shadowing?", you 
don't :)

> This, however, turned out to be unacceptably slow; this file is 1.1M
> lines, and it takes roughly a minute to go through. I have 450 of
> these files; I don't have the luxury to let it run for 8 hours.

Really? And how many hours have you spent trying to speed this up? Two? 
Three? Seven? And if it takes people two or three hours to answer your 
question, and you another two or three hours to read it, it would have 
been faster to just run the code as given :)

I'm just saying.

Since you don't show the actual critical part of the code, I'm going to 
make some simple suggestions that you may or may not have already tried.

- don't read files off USB or CD or over the network, because it will 
likely be slow; if you can copy the files onto the local hard drive, 
performance may be better;

- but if you include the copying time, it might not make that much 
difference;

- can you use a dedicated tool for this, like Unix grep or even perl, 
which is optimised for high-speed file manipulations?

- if you need to stick with Python, try this:

# untested
results = []
fp = open('filename')
for line in fp:
    if key in line:  
        # Found key, skip the next line and save the following.
        _ = next(fp, '')
        results.append(next(fp, ''))

By the way, the above assumes you are running Python 2.6 or better. In 
Python 2.5, you can define this function:

def next(iterator, default):
    try:
        return iterator.next()
    except StopIteration:
        return default

but it will likely be a little slower.

Another approach may be to read the whole file into memory in one big 
chunk. 1.1 million lines, by (say) 50 characters per line comes to about 
53 MB per file, which should be small enough to read into memory and 
process it in one chunk. Something like this:

# again untested
text = open('filename').read()
results = []
i = 0
while i < len(text):
    offset = text.find(key, i)
    if i == -1: break
    i += len(key)  # skip the rest of the key
    # read ahead to the next newline, twice
    i = text.find('\n', i)
    i = text.find('\n', i)
    # now find the following newline, and save everything up to that
    p = text.find('\n', i)
    if p == -1:  p = len(text)
    results.append(text[i:p])
    i = p  # skip ahead

This will likely break if the key is found without two more lines 
following it.

-- 
Steven