[Tutor] module to parse XMLish text?
Stefan Behnel
stefan_ml at behnel.de
Fri Jan 14 10:24:34 CET 2011
Terry Carroll, 14.01.2011 03:55:
> Does anyone know of a module that can parse out text with XML-like tags as
> in the example below? I emphasize the "-like" in "XML-like". I don't think
> I can parse this as XML (can I?).
>
> Sample text between the dashed lines::
>
> ---------------------------------
> Blah, blah, blah
> <AAA>
> <BING ZEBRA>
> <BANG ROOSTER>
> <BOOM GARBONZO BEAN>
> <BLIP>SOMETHING ELSE</BLIP>
> <BASH>SOMETHING DIFFERENT</BASH>
> </AAA>
> ---------------------------------
You can't parse this as XML because it's not XML. The three initial child
tags are not properly closed.
If the format is really as you describe, i.e. one line per tag, regular
expressions will work nicely. Something like (untested)
import re
parse_tag_and_text = re.compile(
# accept a tag name and then either space+tag or '>'+text+'</...'
'^<([^> ]+)(?: ([^>]+)>\s*|>([^<]+)</.*)$').match
special_tags = set(['AAA'])
result = {}
for line in the_file:
match = parse_tag_and_text(line)
if match:
if match.group(1) in special_tags:
pass # do something special?
else:
# don't care which format, take whatever text group matched
result[match.group(1)] = match.group(2) or match.group(3)
Stefan
More information about the Tutor
mailing list