Partial victory (was RE: [Python-Dev] RE: test_sax failing
(Windows))
Christian Tismer
tismer@tismer.com
Tue, 23 Jan 2001 18:35:08 +0100
uche.ogbuji@fourthought.com wrote:
>
> > > This has nothing to do with Python. UTF-8 marks the codes
> > > from 128-191 as illegal prefix.
> > [...]
> > > Perhaps the parser should catch the UnicodeError and
> > > instead return a not-wellformed exception ?!
> >
> > Right on both accounts. If no encoding is specified, and if the
> > document appears not to be UTF-16 in any endianness, an XML processor
> > shall assume it is UTF-8. As Marc-Andre explains, your document is not
> > proper UTF-8, hence the error.
> >
> > The confusing thing is that expat itself does not care about it not
> > being UTF-8; that is only detected when the callback is invoked in
> > pyexpat, and therefore conversion to a Unicode object is attempted.
>
> Pyexpat violates the XML spec here. XML parsers are not allowed to "recover"
> from well-formedness errors. And I would classify blithley reporting the
> character data as "recovery".
>
> However, I'm amazed that this wouldn't have come up before, considering the
> pedigree of expat.
Well, I had to write a preprocessor which turns some "xml-like"
but not well-formed stuff into something useable. This was a
bulk of 100 MB of data, partially hand-written, partially
machine-generated, but not really well-formed. Some
special characters appeared very late in the data set, raising
an error in Python 2.0, but not in 1.5.2, so I perceived
it as an error in the parser first, not the data. :-)
ciao - chris
--
Christian Tismer :^) <mailto:tismer@tismer.com>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Kaunstr. 26 : *Starship* http://starship.python.net
14163 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
where do you want to jump today? http://www.stackless.com