[Expat-discuss] & symbol workaround

Brad Causey bradcausey at gmail.com
Wed Feb 4 21:40:27 CET 2009


Nick,

I completely agree. Unfortunately, I don't have control over the code that
generates these XML files.
If there isn't a better alternative, I'll have to create a duplicate of
EVERY file and parse each one at a text level to replace non-standard
characters with a escaped version. (doing this for < is nearly impossible)
This is something I am trying to avoid for obvious reasons. I don't like
non-standard XML any more than the next guy. (I've been through 3 different
python XML parsers trying to resolve this) But I'm running out of options.
Any ideas?



-Brad


On Wed, Feb 4, 2009 at 2:30 PM, Nick <nickmacd at xxx.com> wrote:

> amp is NOT valid as a standalone character in XML and needs to be
> escaped as &amp; otherwise you are not parsing standard (and thus
> valid) XML files, but in fact parsing some other hybrid thing.
>
> Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ):
>
> The ampersand character (&) and the left angle bracket (<) MUST NOT
> appear in their literal form, except when used as markup delimiters,
> or within a comment, a processing instruction, or a CDATA section. If
> they are needed elsewhere, they MUST be escaped using either numeric
> character references or the strings " &amp;  " and " &lt;  "
> respectively. The right angle bracket (>) may be represented using the
> string " &gt;  ", and MUST, for compatibility, be escaped using either
> " &gt;  " or a character reference when it appears in the string " ]]>
>  " in content, when that string is not marking the end of a CDATA
> section.
>
> So I would argue that you NEED to change the source files, in order to
> bring them into line with the standard.
>
> Nick
>
>
> On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey <bradcausey at xxx.com<bradcausey at gmail.com>>
> wrote:
> > I am working on a Python script that parses around 6800 small xml files.
> > My code isn't pretty, as I am just testing a PoC at this point, but I
> have
> > run into a problem. When the script hits the Ampersand symbol, it quits
> with
> > "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28,
> > column 41"
> >
> > I am trying to figure out a way to work around this without modifying the
> > XML files themselves as these need to be preserved in the original
> format.
> <NickMacD at gmail.com>


More information about the Expat-discuss mailing list