Parsing text

Tim Chase python.list at tim.thechases.com
Wed May 6 14:59:19 EDT 2009


> I'm trying to write a fairly basic text parser to split up scenes and
> acts in plays to put them into XML. I've managed to get the text split
> into the blocks of scenes and acts and returned correctly but I'm
> trying to refine this and get the relevant scene number when the split
> is made but I keep getting an NoneType error trying to read the block
> inside the for loop and nothing is being returned. I'd be grateful for
> some suggestions as to how to get this working.
> 
> for scene in text.split('Scene'):
>     num = re.compile("^\s\[0-9, i{1,4}, v]", re.I)

The first thing that occurs to me is that this should likely be a 
raw string to get those backslashes into the regexp.  Compare:

   print "^\s\[0-9, i{1,4}, v]"
   print r"^\s\[0-9, i{1,4}, v]"

Without an excerpt of the actual text (or at least the lead-in 
for each scene), it's hard to tell whether this regex finds what 
you expect.  It doesn't look like your regexp finds what you may 
think it does (it looks like you're using commas .

Just so you're aware, your split is a bit fragile too, in case 
any lines contain "Scene".  However, with a proper regexp, you 
can even use it to split the scenes *and* tag the scene-number. 
Something like

   >>> import re
   >>> s = """Scene [42]
   ... this is stuff in the 42nd scene
   ... Scene [IIV]
   ... stuff in the other scene
   ... """
   >>> r = re.compile(r"Scene\s+\[(\d+|[ivx]+)]", re.I)
   >>> r.split(s)[1:]
   ['42', '\nthis is stuff in the 42nd scene\n', 'IIV', '\nstuff 
in the other scene\n']
   >>> def grouper(iterable, groupby):
   ...     iterable = iter(iterable)
   ...     while True:
   ...             yield [iterable.next() for _ in range(groupby)]
   ...

   >>> for scene, content in grouper(r.split(s)[1:], 2):
   ...     print "<div class='scene'><h1>%s</h1><p>%s</p></div>" 
% (scene, content)
   ...
   <div class='scene'><h1>42</h1><p>
   this is stuff in the 42nd scene
   </p></div>
   <div class='scene'><h1>IIV</h1><p>
   stuff in the other scene
   </p></div>

Play accordingly.

-tkc







More information about the Python-list mailing list