[Tutor] Finding a specific line in a body of text

Steven D'Aprano steve at pearwood.info
Mon Mar 12 06:10:25 CET 2012


On Mon, Mar 12, 2012 at 05:46:39AM +0100, Robert Sjoblom wrote:
> > You haven't shown us the critical part: how are you getting the lines in
> > the first place?
> 
> Ah, yes --
> with open(address, "r", encoding="cp1252") as instream:
>     for line in instream:

Seems reasonable.


> > (Also, you shouldn't shadow built-ins like list as you do above, unless
> > you know what you are doing. If you have to ask "what's shadowing?", you
> > don't :)
> Maybe I should have said list_name.append() instead; sorry for that.

No problems :) Shadowing builtins is fine if you know what you're doing, 
but it's the people who do it without realising that end up causing 
themselves trouble.


> >> This, however, turned out to be unacceptably slow; this file is 1.1M
> >> lines, and it takes roughly a minute to go through. I have 450 of
> >> these files; I don't have the luxury to let it run for 8 hours.
> >
> > Really? And how many hours have you spent trying to speed this up? Two?
> > Three? Seven? And if it takes people two or three hours to answer your
> > question, and you another two or three hours to read it, it would have
> > been faster to just run the code as given :)
> Yes, for one set of files. Since I don't know how many sets of ~450
> files I'll have to run this over, I think that asking for help was a
> rather acceptable loss of time. I work on other parts while waiting
> anyway, or try and find out on my own as well.

All very reasonable. So long as you have considered the alternatives.


> > - if you need to stick with Python, try this:
> >
> > # untested
> > results = []
> > fp = open('filename')
> > for line in fp:
> >    if key in line:
> >        # Found key, skip the next line and save the following.
> >        _ = next(fp, '')
> >        results.append(next(fp, ''))
> 
> Well that's certainly faster, but not fast enough.

You may have to consider that your bottleneck is not the speed of your 
Python code, but the speed of getting data off the disk into memory. In 
which case, you may be stuck.

I suggest you time how long it takes to process a file using the above, 
then compare it to how long just reading the file takes:

from time import clock
t = clock()
for line in open('filename', encoding='cp1252'):
    pass
print(clock() - t)

Run both timings a couple of times and pick the smallest number, to 
minimise caching effects and other extraneous influences.

Then do the same using a system tool. You're using Windows, right? I 
can't tell you how to do it in Windows, but on Linux I'd say:

time cat 'filename' > /dev/null

which should give me a rough-and-ready estimate of the raw speed of 
reading data off the disk. If this speed is not *significantly* better 
than you are getting in Python, then there simply isn't any feasible way 
to speed the code up appreciably. (Except maybe get faster hard drives 
or smaller files .)

[...]
> How do you approach something like this, when someone tells you "we
> need you to parse these files. We can't tell you how they're
> structured so you'll have to figure that out yourself."? 

Bitch and moan quietly to myself, and then smile when I realise I'm 
being paid by the hour.

Reverse-engineering a file structure without any documentation is rarely 
simple or fast.



-- 
Steven


More information about the Tutor mailing list