Partial victory (was RE: [Python-Dev] RE: test_sax failing (Windows))

uche.ogbuji@fourthought.com uche.ogbuji@fourthought.com
Tue, 23 Jan 2001 10:28:18 -0700


> > This has nothing to do with Python. UTF-8 marks the codes 
> > from 128-191 as illegal prefix. 
> [...]
> > Perhaps the parser should catch the UnicodeError and
> > instead return a not-wellformed exception ?!
> 
> Right on both accounts. If no encoding is specified, and if the
> document appears not to be UTF-16 in any endianness, an XML processor
> shall assume it is UTF-8. As Marc-Andre explains, your document is not
> proper UTF-8, hence the error.
> 
> The confusing thing is that expat itself does not care about it not
> being UTF-8; that is only detected when the callback is invoked in
> pyexpat, and therefore conversion to a Unicode object is attempted.

Pyexpat violates the XML spec here.  XML parsers are not allowed to "recover" 
from well-formedness errors.  And I would classify blithley reporting the 
character data as "recovery".

However, I'm amazed that this wouldn't have come up before, considering the 
pedigree of expat.

I'll poke around, and raise a bug on the expat site if need be.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python