[Tutor] module to parse XMLish text?

Fri Jan 14 10:24:34 CET 2011

Terry Carroll, 14.01.2011 03:55:
> Does anyone know of a module that can parse out text with XML-like tags as
> in the example below? I emphasize the "-like" in "XML-like". I don't think
> I can parse this as XML (can I?).
>
> Sample text between the dashed lines::
>
> ---------------------------------
> Blah, blah, blah
> <AAA>
> <BING ZEBRA>
> <BANG ROOSTER>
> <BOOM GARBONZO BEAN>
> <BLIP>SOMETHING ELSE</BLIP>
> <BASH>SOMETHING DIFFERENT</BASH>
> </AAA>
> ---------------------------------

You can't parse this as XML because it's not XML. The three initial child 
tags are not properly closed.

If the format is really as you describe, i.e. one line per tag, regular 
expressions will work nicely. Something like (untested)

   import re
   parse_tag_and_text = re.compile(
         # accept a tag name and then either space+tag or '>'+text+'</...'
         '^<([^> ]+)(?: ([^>]+)>\s*|>([^<]+)</.*)$').match

   special_tags = set(['AAA'])

   result = {}
   for line in the_file:
       match = parse_tag_and_text(line)
       if match:
           if match.group(1) in special_tags:
               pass # do something special?
           else:
               # don't care which format, take whatever text group matched
               result[match.group(1)] = match.group(2) or match.group(3)

Stefan