[XML-SIG] unicode, latin-1 and DOM...

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Fri, 29 Jun 2001 01:20:01 +0200


> I would have expected a parse error when the latin-1 characters where
> encountered, and not a silent failure to create the Text node.

Using plain SAX, I get

>>> xml.sax.parseString('<d>=E9t=E9</d>')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: parseString() takes at least 2 arguments (1 given)
>>> xml.sax.parseString('<d>=E9t=E9</d>',xml.sax.ContentHandler())
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/__init__.py", l=
ine 47, in parseString
    parser.parse(inpsrc)
  File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py"=
, line 43, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/xmlreader.py", =
line
123, in parse
    self.feed(buffer)
  File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py"=
, line 96, in feed
    self._parser.Parse(data, isFinal)
UnicodeError: UTF-8 decoding error: invalid data

What appears to happen is that Expat accepts the string and passes it
to the character handler (not sure why it does that); the pyexpat
character handler then assumes that it is UTF-8, and raises a
UnicodeError as it cannot convert it from Unicode. Most likely, the
DOM reader masks the UnicodeError.

xml.dom.minidom.parseString produces the same error.

Regards,
Martin