xml.sax and reserved XML characters in the data stream

Thu Feb 6 14:18:15 EST 2003

"Eric I. Arnoth" <earnoth at comcast.net> wrote:
> Here's the referenced chunk of the file:
>
============================================================================
====
> jefferson6:18pm[144]> head -128 furball.short.xml | tail -1
>                         <setting name="Misc information on News
> server[entry]:From address :" value="email <listme at foobar.org>"/>
>
============================================================================
====
>
> I have no choice but to deal with this, as the XML file is produced by a
> third-party application and I have no control over the format or the
> content.

The third-party application generating the file with the unescaped "<" in an
attribute value is buggy, as you seem to realize. Some would say the file in
question is technically not XML at all.

An XML parser is required to abort processing if the entity it is parsing is
not well-formed. There's no reasonable way around it. You'll have to
preprocess it to make sure it's well-formed before feeding it to the parser.
Here's the meat of what you need:

import re
attrPat = re.compile(r' (\w+)="([^"]+)"')
f = open('furball.short.xml')
badxml = f.read()
f.close()
betterxml = attrPat.sub(lambda m: m.expand(r'
\1="\2"').replace('&','&').replace('<','<'), badxml)

You'll have to get a bit trickier than that if you have unescaped double
quotes in your attribute values.