Partial victory (was RE: [Python-Dev] RE: test_sax failing (Windows))

Christian Tismer tismer@tismer.com
Tue, 23 Jan 2001 18:35:08 +0100


uche.ogbuji@fourthought.com wrote:
> 
> > > This has nothing to do with Python. UTF-8 marks the codes
> > > from 128-191 as illegal prefix.
> > [...]
> > > Perhaps the parser should catch the UnicodeError and
> > > instead return a not-wellformed exception ?!
> >
> > Right on both accounts. If no encoding is specified, and if the
> > document appears not to be UTF-16 in any endianness, an XML processor
> > shall assume it is UTF-8. As Marc-Andre explains, your document is not
> > proper UTF-8, hence the error.
> >
> > The confusing thing is that expat itself does not care about it not
> > being UTF-8; that is only detected when the callback is invoked in
> > pyexpat, and therefore conversion to a Unicode object is attempted.
> 
> Pyexpat violates the XML spec here.  XML parsers are not allowed to "recover"
> from well-formedness errors.  And I would classify blithley reporting the
> character data as "recovery".
> 
> However, I'm amazed that this wouldn't have come up before, considering the
> pedigree of expat.

Well, I had to write a preprocessor which turns some "xml-like"
but not well-formed stuff into something useable. This was a
bulk of 100 MB of data, partially hand-written, partially
machine-generated, but not really well-formed. Some
special characters appeared very late in the data set, raising
an error in Python 2.0, but not in 1.5.2, so I perceived
it as an error in the parser first, not the data. :-)

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@tismer.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com