Partial victory (was RE: [Python-Dev] RE: test_sax failing
(Windows))
M.-A. Lemburg
mal@lemburg.com
Mon, 22 Jan 2001 15:27:38 +0100
Christian Tismer wrote:
>
> Christian Tismer wrote:
> >
> > Maybe I can help.
>
> ...
>
> ...
> > I will now try to create a minimized script and XML data that
> > produces the above again.
> >
> > back in an hour - chris
>
> Here we go.
> The following session produces the mentioned UTF8 error:
>
> >>> txt = "<master desc='blah\325weird' />"
> >>> def startelt(name, dic):
> ... print name, dic
> ...
> >>> p=expat.ParserCreate()
> >>> p.StartElementHandler = startelt
> >>> p.Parse(txt)
> Traceback (innermost last):
> File "<interactive input>", line 1, in ?
> UnicodeError: UTF-8 decoding error: invalid data
>
> Behavior depends of the ASCII code.
> >From code 128 (0200) to 191 (0277) the parser gives an
> not well-formed exception, as it should be.
>
> The codes from 192 to 236, 238-243 produce
> "UTF-8 decoding error: invalid data",
> the rest gives "not well-formed".
>
> I would like to know if this happens with your (Tim) modified
> version as well. I'm using plain vanilla BeOpen Python 2.0 .
This has nothing to do with Python. UTF-8 marks the codes
from 128-191 as illegal prefix. See Object/unicodeobject.c:
static
char utf8_code_length[256] = {
/* Map UTF-8 encoded prefix byte to sequence length. zero means
illegal prefix. see RFC 2279 for details */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0
};
Perhaps the parser should catch the UnicodeError and
instead return a not-wellformed exception ?!
--
Marc-Andre Lemburg
______________________________________________________________________
Company: http://www.egenix.com/
Consulting: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/