Partial victory (was RE: [Python-Dev] RE: test_sax failing (Windows))

M.-A. Lemburg mal@lemburg.com
Mon, 22 Jan 2001 15:27:38 +0100


Christian Tismer wrote:
> 
> Christian Tismer wrote:
> >
> > Maybe I can help.
> 
> ...
> 
> ...
> > I will now try to create a minimized script and XML data that
> > produces the above again.
> >
> > back in an hour - chris
> 
> Here we go.
> The following session produces the mentioned UTF8 error:
> 
> >>> txt = "<master desc='blah\325weird' />"
> >>> def startelt(name, dic):
> ...     print name, dic
> ...
> >>> p=expat.ParserCreate()
> >>> p.StartElementHandler = startelt
> >>> p.Parse(txt)
> Traceback (innermost last):
>   File "<interactive input>", line 1, in ?
> UnicodeError: UTF-8 decoding error: invalid data
> 
> Behavior depends of the ASCII code.
> >From code 128 (0200) to 191 (0277) the parser gives an
> not well-formed exception, as it should be.
> 
> The codes from 192 to 236, 238-243 produce
> "UTF-8 decoding error: invalid data",
> the rest gives "not well-formed".
> 
> I would like to know if this happens with your (Tim) modified
> version as well. I'm using plain vanilla BeOpen Python 2.0 .

This has nothing to do with Python. UTF-8 marks the codes 
from 128-191 as illegal prefix. See Object/unicodeobject.c:

static 
char utf8_code_length[256] = {
    /* Map UTF-8 encoded prefix byte to sequence length.  zero means
       illegal prefix.  see RFC 2279 for details */
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0
};

Perhaps the parser should catch the UnicodeError and
instead return a not-wellformed exception ?!
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/