[XML-SIG] Parsing XML

Sat Dec 13 10:11:32 EST 2003

> My XML files have to use encoding 'iso-8859-1',which is different
> from the default encoding 'utf-8'.

Technically, there is no default, but conforming parsers assume utf-16 until
they see there's no byte-order mark (BOM) at the beginning, and then assume
utf-8 until they see something else declared in the prolog.

> When I was using the package from 4DOM(pyxml.souceforge.net)
> to parse my XML files,errors occured.

What errors, specifically? 

Are you sure your XML files are actually iso-8859-1 encoded? 

Note: it is the XML author's responsibility to ensure that the encoding
declaration in the prolog accurate reflects the actual encoding of the
document. If you had a gb2312 file and just changed the declaration to say
iso-8859-1, you didn't change the actual encoding of the document, you just
made the declaration be wrong, which an XML parser is required to treat as a
fatal error.

> The package for parsing xml
> only supports encoding 'utf-8', right?

No, the parser that 4DOM uses (Expat) supports other encodings, as I mentioned
in my other message today. iso-8859-1 should work just fine.

If you are still trying to parse gb2312-encoded XML, you need to do more than
just replace 'gb2312' with 'iso-8859-1' in the encoding declaration. Use
Python's codecs module to wrap your gb2312 stream, decoding from gb2312 to
Unicode, at which point you can safely rewrite the declaration in the prolog
if necessary, and then wrap again, encoding from Unicode to utf-8 (or utf-16).
This is what I meant by 'transcode'. You won't need to rewrite the declaration
if you can figure out how to make Expat accept the external encoding
declaration from Python. I was hoping a PyExpat expert would suggest the
answer.

-Mike