[Python-bugs-list] [ python-Bugs-817156 ] invalid \U escape gives 0=length unistr

Mon Oct 6 01:08:56 EDT 2003

Bugs item #817156, was opened at 2003-10-03 13:30
Message generated for change (Comment added) made by jhylton
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=817156&group_id=5470

Category: Unicode
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 5
Submitted By: Jeff Epler (jepler)
Assigned to: M.-A. Lemburg (lemburg)
Summary: invalid \U escape gives 0=length unistr

Initial Comment:
>>> u'\Ufffffffe' # CORRECT

UnicodeDecodeError: 'unicodeescape' codec can't decode

bytes in position 0-9: illegal Unicode character

>>> u'\Uffffffff'  # WRONG

u''

>>> len(_)

0

Observed on 2.2.2 (redhat wide-unicode build,

sys.maxunicode=1114111), 2.3.1 (custom build,

sys.maxunicode == 65535)

I think the problem is due to this logic in

unicodeobject.c:PyUnicode_DecodeUnicodeEscape()

            if (chr == 0xffffffff)

                /* _decoding_error will have already

written into the

                   target buffer. */

                break;

perhaps it should be (chr == 0xffffffff &&

PyErr_Occurred()) 

I tried this change locally, and it fixes the problem:

>>> u'\Uffffffff'

UnicodeDecodeError: 'unicodeescape' codec can't decode

bytes in position 0-9: illegal Unicode character

>>> u'\Ufffffffe'

UnicodeDecodeError: 'unicodeescape' codec can't decode

bytes in position 0-9: illegal Unicode character

and doesn't change the outcome of the test suite.

Patch against 2.3.1 attached.

----------------------------------------------------------------------

>Comment By: Jeremy Hylton (jhylton)
Date: 2003-10-06 05:08

Message:
Logged In: YES 
user_id=31392

Fixed in rev. 2.199 of unicodeobject.c.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=817156&group_id=5470