[Tutor] Parsing text file
Dave Kuhlman
dkuhlman at rexx.com
Mon May 14 20:16:19 CEST 2007
On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote:
> I'm looking for a more elegant way to parse sections of text files that
> are bordered by BEGIN/END delimiting phrases, like this:
> some text
> some more text
> someline1
> someline2
> someline3
> more text
> more text
> What I have been doing is clumsy, involving converting to a string and
> slicing out the required section using split('DELIMITER'):
> import sys
> infile = open(sys.argv[1], 'r')
> #join list elements with @ character into a string
> fileStr = '@'.join(infile.readlines())
> #Slice out the interesting section with split, then split again into
> lines using @
> resultLine =
> fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@')
> for line in resultLine:
> do things
> Can anyone point me at a better way to do this?
Possibly over-kill, but ...
How much fun are you interested in having? Others have given you
the "low fun" easy way. Now ask yourself whether this task is
likely to become more complex (the interesting parts more hidden in
a more complex grammar) and perhaps you also can't wait to have
some fun. Is so, consider this suggestion:
1. Write grammar rules that describe your input text. In your
case, those rules might look something like the following:
Seq ::= {InterestingChunk | UninterestingChunk}*
InterestingChunk ::= BeginToken InterestingSeq EndToken
InterestingSeq ::= InterestingChunk*
2. For each rule, write a Python function that tries to recognize
what the rule describes. To do its job, each function might
call other functions that implement other grammar rules and
might call a tokenizer function (see below) when it needs
another token from the input stream. Example:
def InterestingChunk_reco(self):
if self.token_type == Tok_Begin:
if self.InterestingSeq_reco():
if self.token_type == Tok_End:
return True
self.Error('bad interesting sequence')
3. Write a tokenizer function. Each time this function is called,
it returns the next "token" (probably a word) from the input
stream and a code that indicates the token type. Recognizer
functions call this tokenizer function each time another token
is needed. In your case there might be 3 token types: (1) plain
word, (2) BeginTok, and (3) EndTok.
If you do the above, you have just written your first recursive
descent parser.
Then, the next time you are at a party, beer bar, or wedding, any
time the conversation comes even remotely close to the subject of
parsing text, you say, "Well, for that kind of problem I usually
write a recursive descent parser. It's the most powerful way and
the only way to go. ..." Now, that's how to impress your friends
and relations.
But, seriously, recursive descent parsers are quite easy and are a
useful technique to have in your tool bag. And, like I said above:
It's fun.
Besides, if your problem becomes more complex, and, for example,
the input is not quite so line oriented, you will need a more
powerful approach.
Wikipedia has a better explanation than mine plus an example and
links: http://en.wikipedia.org/wiki/Recursive_descent_parser
I've attached a sample solution and sample input.
Also, be aware that there are parse generators for Python.
Dave Kuhlman
-------------- next part --------------
A non-text attachment was scrubbed...
Name: recursive_descent_parser.py
Type: text/x-python
Size: 3385 bytes
Desc: not available
Url : http://mail.python.org/pipermail/tutor/attachments/20070514/ea5a84b3/attachment.py
-------------- next part --------------
aaa bbb ccc
ddd eee
hhh iii END_INTERESTING_BIT jjj kkk
ppp qqq rrr
sss ttt
lll mmm
More information about the Tutor
mailing list