[Tutor] how to parse a multiple character words from plaintext

Kent Johnson kent37 at tds.net
Sun Feb 24 15:21:10 CET 2008


---- John Gunderman <meanburrito920 at yahoo.com> wrote: 
> I am parsing the output of the mork.pl, which is a DORK (the mozilla format) parser. I don't know Perl, so I decided to write a Python script to do what I wanted, which basically is to create a dictionary listing each site and its corresponding values instead of outputting into plaintext. Unfortunately, the output of mork.pl is 5000+ lines so reading the whole document wouldn't be that efficient.

OK, I looked briefly at mork.pl. You should be able to process it line-by-line with something like this:

for line in history_file:
  if not line.strip():
    continue # skip blank lines; may not be needed
  time, count, url = line.split()
  # do something with time, count, url

Kent

 Currently it uses:
>         for line in history_file.readlines():
> but I dont know if this has to read all lines before it goes through it. if it does, then would it be more efficient to use
>         while line != '/t':
>             line = history_file.readline()    
> I was thinking of just appending each character to the string until it sees '/t', and then using int() on the string, but is there an easier way?
> 
> John
> 
> ----- Original Message ----
> From: Kent Johnson <kent37 at tds.net>
> To: John Gunderman <meanburrito920 at yahoo.com>
> Cc: tutor at python.org
> Sent: Saturday, February 23, 2008 3:43:44 AM
> Subject: Re: [Tutor] how to parse a multiple character words from plaintext
> 
> John Gunderman wrote:
> > I am looking to parse a plaintext from a document. However, I am 
> > confused about the actual methodology of it. This is because some of the 
> > words will be multiple digits or characters. However, I don't know the 
> > length of the words before the parse. Is there a way to somehow have 
> > open() grab something until it sees a /t or ' '? I was thinking I could 
> > have it count ahead the number of spaces till the stopping point and 
> > then parse till that point using read(), but that seems sort of 
> > inefficient. Is there a better way to pull this off? Thanks in advance.
> 
> How big is the file? Can you just read the whole document and parse the 
> resulting string? Or read by lines?
> 
> Depending on how complex your parsing is, you might want to use 
> pyparsing or one of the other Python parser libraries.
> http://pyparsing.wikispaces.com/
> http://nedbatchelder.com/text/python-parsers.html
> 
> Kent
> 
> 
> 
> 
> 
> 
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping



More information about the Tutor mailing list