Partial victory (was RE: [Python-Dev] RE: test_sax failing (Windows))

Martin von Loewis loewis@informatik.hu-berlin.de
Mon, 22 Jan 2001 15:46:39 +0100 (MET)


> This has nothing to do with Python. UTF-8 marks the codes 
> from 128-191 as illegal prefix. 
[...]
> Perhaps the parser should catch the UnicodeError and
> instead return a not-wellformed exception ?!

Right on both accounts. If no encoding is specified, and if the
document appears not to be UTF-16 in any endianness, an XML processor
shall assume it is UTF-8. As Marc-Andre explains, your document is not
proper UTF-8, hence the error.

The confusing thing is that expat itself does not care about it not
being UTF-8; that is only detected when the callback is invoked in
pyexpat, and therefore conversion to a Unicode object is attempted.

The right solution probably would be to change expat so that it
determines correctness of the encoding for each string it gets as part
of the wellformedness analysis, and produces illformedness exceptions
when an encoding error occurs. Patches are welcome, although they
probable should go to sourceforge.net/projects/expat.

Regards,
Martin