[Tutor] how to parse a multiple character words from plaintext

Sun Feb 24 15:01:27 CET 2008

---- John Gunderman <meanburrito920 at yahoo.com> wrote: 
> I am parsing the output of the mork.pl, which is a DORK (the mozilla format) parser. I don't know Perl, so I decided to write a Python script to do what I wanted, which basically is to create a dictionary listing each site and its corresponding values instead of outputting into plaintext. Unfortunately, the output of mork.pl is 5000+ lines so reading the whole document wouldn't be that efficient. 

If you have enough memory for it to fit, reading the whole file at once is fine.

> Currently it uses:
>         for line in history_file.readlines():
> but I dont know if this has to read all lines before it goes through it. 

Yes, readlines() reads the entire file.

> if it does, then would it be more efficient to use
>         while line != '/t':
>             line = history_file.readline()    

Probably not. But why so much emphasis on efficiency? Get the program working first. Only if it is too slow should you worry about efficiency. Processing a 5000-line file should not be a problem in Python.

> I was thinking of just appending each character to the string until it sees '/t', and then using int() on the string, but is there an easier way?

It would really help to see a sample of the data and the results you want from it. There are many ways to parse data in Python, from simple string operations to regular expressions to full-blown parsers. Without knowing what you want to do it is impossible to suggest an appropriate method.

Kent