[Tutor] how to parse a multiple character words from plaintext

Sat Feb 23 13:13:40 CET 2008

"John Gunderman" <meanburrito920 at yahoo.com> wrote 

>I am looking to parse a plaintext from a document. 

When you say "a document" what kind of document do you 
mean? Is the document also in plain text, like HTML, or is 
it a binary format like MS Word?

> some of the words will be multiple digits or characters. 
> However, I don't know the length of the words before the parse. 

Look at the regula5r expression module re.
regular expressions allow you to define patterns and 
then search for those patterns within a string.

> Is there a way to somehow have open() grab something 
> until it sees a /t or ' '? 

open() doesn't grab anything, it simply makes the file 
available for reading. You can then use read to either 
read the whole file or a fixed number of characers.
You can also use readline() to read a single line or 
readlines() to read the emntire file into a list of lines.
Which you use will depend a lot on the format of 
your data.

> I was thinking I could have it count ahead the 
> number of spaces till the stopping point and then 
> parse till that point using read(), but that seems sort 
> of inefficient. 

It may or may not be efficiant but its certainly complex 
since it requires you to know in advance what the next 
bit of data looks like. If it follows a set pattern that may 
be OK. If possible you probably would be better reading 
the data line by line and parsing each line. However if 
the data spills across lines that will probably not be viable.

If the file is not too big(a few MB say) then siomply 
reading the entire file as a single string and using 
regular expressions may be the easiest way.

HTH,

-- 
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld