[Tutor] Parsing text file

Mon May 14 20:16:19 CEST 2007

On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote:
> I'm looking for a more elegant way to parse sections of text files that 
> are bordered by BEGIN/END delimiting phrases, like this:
> 
> some text
> some more text
> BEGIN_INTERESTING_BIT
> someline1
> someline2
> someline3
> END_INTERESTING_BIT
> more text
> more text
> 
> What I have been doing is clumsy, involving converting to a string and 
> slicing out the required section using split('DELIMITER'): 
> 
> import sys
> infile = open(sys.argv[1], 'r')
> #join list elements with @ character into a string
> fileStr = '@'.join(infile.readlines())
> #Slice out the interesting section with split, then split again into 
> lines using @
> resultLine = 
> fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@')
> for line in resultLine:
>     do things
> 
> Can anyone point me at a better way to do this?
> 

Possibly over-kill, but ...

How much fun are you interested in having?  Others have given you
the "low fun" easy way.  Now ask yourself whether this task is
likely to become more complex (the interesting parts more hidden in
a more complex grammar) and perhaps you also can't wait to have
some fun.  Is so, consider this suggestion:

1. Write grammar rules that describe your input text.  In your
   case, those rules might look something like the following:

       Seq ::= {InterestingChunk | UninterestingChunk}*
       InterestingChunk ::= BeginToken InterestingSeq EndToken
       InterestingSeq ::= InterestingChunk*

2. For each rule, write a Python function that tries to recognize
   what the rule describes.  To do its job, each function might
   call other functions that implement other grammar rules and
   might call a tokenizer function (see below) when it needs
   another token from the input stream.  Example:

       def InterestingChunk_reco(self):
           if self.token_type == Tok_Begin:
               self.get_token()
               if self.InterestingSeq_reco():
                   if self.token_type == Tok_End:
                       self.get_token()
                       return True
                   else:
                       self.Error('bad interesting sequence')

3. Write a tokenizer function.  Each time this function is called,
   it returns the next "token" (probably a word) from the input
   stream and a code that indicates the token type.  Recognizer
   functions call this tokenizer function each time another token
   is needed.  In your case there might be 3 token types: (1) plain
   word, (2) BeginTok, and (3) EndTok.

If you do the above, you have just written your first recursive
descent parser.

Then, the next time you are at a party, beer bar, or wedding, any
time the conversation comes even remotely close to the subject of
parsing text, you say, "Well, for that kind of problem I usually
write a recursive descent parser.  It's the most powerful way and
the only way to go.  ..." Now, that's how to impress your friends
and relations.

But, seriously, recursive descent parsers are quite easy and are a
useful technique to have in your tool bag.  And, like I said above:
It's fun.

Besides, if your problem becomes more complex, and, for example,
the input is not quite so line oriented, you will need a more
powerful approach.

Wikipedia has a better explanation than mine plus an example and
links: http://en.wikipedia.org/wiki/Recursive_descent_parser

I've attached a sample solution and sample input.

Also, be aware that there are parse generators for Python.

Dave

-- 
Dave Kuhlman
http://www.rexx.com/~dkuhlman
-------------- next part --------------
A non-text attachment was scrubbed...
Name: recursive_descent_parser.py
Type: text/x-python
Size: 3385 bytes
Desc: not available
Url : http://mail.python.org/pipermail/tutor/attachments/20070514/ea5a84b3/attachment.py 
-------------- next part --------------

aaa bbb ccc

ddd eee
BEGIN_INTERESTING_BIT fff ggg
hhh iii END_INTERESTING_BIT jjj kkk
BEGIN_INTERESTING_BIT
ppp qqq rrr
sss ttt
END_INTERESTING_BIT
lll mmm