[Tutor] how to parse a multiple character words from plaintext
Alan Gauld
alan.gauld at btinternet.com
Sat Feb 23 13:13:40 CET 2008
"John Gunderman" <meanburrito920 at yahoo.com> wrote
>I am looking to parse a plaintext from a document.
When you say "a document" what kind of document do you
mean? Is the document also in plain text, like HTML, or is
it a binary format like MS Word?
> some of the words will be multiple digits or characters.
> However, I don't know the length of the words before the parse.
Look at the regula5r expression module re.
regular expressions allow you to define patterns and
then search for those patterns within a string.
> Is there a way to somehow have open() grab something
> until it sees a /t or ' '?
open() doesn't grab anything, it simply makes the file
available for reading. You can then use read to either
read the whole file or a fixed number of characers.
You can also use readline() to read a single line or
readlines() to read the emntire file into a list of lines.
Which you use will depend a lot on the format of
your data.
> I was thinking I could have it count ahead the
> number of spaces till the stopping point and then
> parse till that point using read(), but that seems sort
> of inefficient.
It may or may not be efficiant but its certainly complex
since it requires you to know in advance what the next
bit of data looks like. If it follows a set pattern that may
be OK. If possible you probably would be better reading
the data line by line and parsing each line. However if
the data spills across lines that will probably not be viable.
If the file is not too big(a few MB say) then siomply
reading the entire file as a single string and using
regular expressions may be the easiest way.
HTH,
--
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld
More information about the Tutor
mailing list