[Tutor] module to parse XMLish text?

Terry Carroll carroll at tjc.com
Fri Jan 14 23:42:55 CET 2011


On Fri, 14 Jan 2011, Stefan Behnel wrote:

> Terry Carroll, 14.01.2011 03:55:
>> Does anyone know of a module that can parse out text with XML-like tags as
>> in the example below? I emphasize the "-like" in "XML-like". I don't think
>> I can parse this as XML (can I?).
>> 
>> Sample text between the dashed lines::
>> 
>> ---------------------------------
>> Blah, blah, blah
>> <AAA>
>> <BING ZEBRA>
>> <BANG ROOSTER>
>> <BOOM GARBONZO BEAN>
>> <BLIP>SOMETHING ELSE</BLIP>
>> <BASH>SOMETHING DIFFERENT</BASH>
>> </AAA>
>> ---------------------------------
>
> You can't parse this as XML because it's not XML. The three initial child 
> tags are not properly closed.

Yeah, that's what I figured.

> If the format is really as you describe, i.e. one line per tag, regular 
> expressions will work nicely.

Now there's an idea!  I hadn't thought of using regexs, probably because 
I'm terrible at all but the most simple ones.

As it happens, I'm only interested in four of the tags' contents, so I
could probably manage to write a seried of regexes that even I could 
maintain, one for each of the pieces of data I want to extract; if I try 
to write a grand unified regex, I'm bound to shoot myself in the foot.

Thanks very much.

On Fri, 14 Jan 2011, Karim wrote:

> from xml.etree.ElementTree import ElementTree

I don't think straight XML parsing will work on this, as it's not valid 
XML; it just looks XML-like enough to cause confusion.


More information about the Tutor mailing list