Partial victory (was RE: [Python-Dev] RE: test_sax failing
(Windows))
uche.ogbuji@fourthought.com
uche.ogbuji@fourthought.com
Tue, 23 Jan 2001 10:28:18 -0700
> > This has nothing to do with Python. UTF-8 marks the codes
> > from 128-191 as illegal prefix.
> [...]
> > Perhaps the parser should catch the UnicodeError and
> > instead return a not-wellformed exception ?!
>
> Right on both accounts. If no encoding is specified, and if the
> document appears not to be UTF-16 in any endianness, an XML processor
> shall assume it is UTF-8. As Marc-Andre explains, your document is not
> proper UTF-8, hence the error.
>
> The confusing thing is that expat itself does not care about it not
> being UTF-8; that is only detected when the callback is invoked in
> pyexpat, and therefore conversion to a Unicode object is attempted.
Pyexpat violates the XML spec here. XML parsers are not allowed to "recover"
from well-formedness errors. And I would classify blithley reporting the
character data as "recovery".
However, I'm amazed that this wouldn't have come up before, considering the
pedigree of expat.
I'll poke around, and raise a bug on the expat site if need be.
--
Uche Ogbuji Principal Consultant
uche.ogbuji@fourthought.com +1 303 583 9900 x 101
Fourthought, Inc. http://Fourthought.com
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python