[Tutor] Finding a specific line in a body of text

Mon Mar 12 06:07:19 CET 2012

Erik Rise gave a good talk today at PyCon about a parsing library he's
working on called Parsimonious. You could maybe look into what he's doing
there, and see if that helps you any... Follow him on Twitter at @erikrose
to see when his session's video is up. His session was named "Parsing
Horrible Things in Python"
On Mar 11, 2012 9:48 PM, "Robert Sjoblom" <robert.sjoblom at gmail.com> wrote:

> > You haven't shown us the critical part: how are you getting the lines in
> > the first place?
>
> Ah, yes --
> with open(address, "r", encoding="cp1252") as instream:
>    for line in instream:
>
> > (Also, you shouldn't shadow built-ins like list as you do above, unless
> > you know what you are doing. If you have to ask "what's shadowing?", you
> > don't :)
> Maybe I should have said list_name.append() instead; sorry for that.
>
> >> This, however, turned out to be unacceptably slow; this file is 1.1M
> >> lines, and it takes roughly a minute to go through. I have 450 of
> >> these files; I don't have the luxury to let it run for 8 hours.
> >
> > Really? And how many hours have you spent trying to speed this up? Two?
> > Three? Seven? And if it takes people two or three hours to answer your
> > question, and you another two or three hours to read it, it would have
> > been faster to just run the code as given :)
> Yes, for one set of files. Since I don't know how many sets of ~450
> files I'll have to run this over, I think that asking for help was a
> rather acceptable loss of time. I work on other parts while waiting
> anyway, or try and find out on my own as well.
>
> > - if you need to stick with Python, try this:
> >
> > # untested
> > results = []
> > fp = open('filename')
> > for line in fp:
> >    if key in line:
> >        # Found key, skip the next line and save the following.
> >        _ = next(fp, '')
> >        results.append(next(fp, ''))
>
> Well that's certainly faster, but not fast enough.
> Oh well, I'll continue looking for a solution -- because even with the
> speedup it's unacceptable. I'm hoping against hope that I only have to
> run it against the last file of each batch of files, but if it turns
> out that I don't, I'm in for some exciting days of finding stuff out.
> Thanks for all the help though, it's much appreciated!
>
> How do you approach something like this, when someone tells you "we
> need you to parse these files. We can't tell you how they're
> structured so you'll have to figure that out yourself."? It's just so
> much text that's it's hard to get a grasp on the structure, and
> there's so much information contained in there as well; this is just
> the first part of what I'm afraid will be many. I'll try not to bother
> this list too much though.
> --
> best regards,
> Robert S.
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120311/6b754ddc/attachment.html>