[XML-SIG] unicode, latin-1 and DOM...
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Fri, 29 Jun 2001 01:20:01 +0200
> I would have expected a parse error when the latin-1 characters where
> encountered, and not a silent failure to create the Text node.
Using plain SAX, I get
>>> xml.sax.parseString('<d>=E9t=E9</d>')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: parseString() takes at least 2 arguments (1 given)
>>> xml.sax.parseString('<d>=E9t=E9</d>',xml.sax.ContentHandler())
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/__init__.py", l=
ine 47, in parseString
parser.parse(inpsrc)
File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py"=
, line 43, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/xmlreader.py", =
line
123, in parse
self.feed(buffer)
File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py"=
, line 96, in feed
self._parser.Parse(data, isFinal)
UnicodeError: UTF-8 decoding error: invalid data
What appears to happen is that Expat accepts the string and passes it
to the character handler (not sure why it does that); the pyexpat
character handler then assumes that it is UTF-8, and raises a
UnicodeError as it cannot convert it from Unicode. Most likely, the
DOM reader masks the UnicodeError.
xml.dom.minidom.parseString produces the same error.
Regards,
Martin