[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 05:35:59 CEST 2009

Glenn Linderman wrote:
> On approximately 4/27/2009 12:42 PM, came the following characters from
> the keyboard of Martin v. Löwis:
>>>> It's a private use area. It will never carry an official character
>>>> assignment.
>>>
>>> I know that U+F0000 - U+FFFFF is a private use area.  I don't find a
>>> definition of U+F01xx to know what the notation means.  Are you picking
>>> a particular character within the private use area, or a particular
>>> range, or what?
>>
>> It's a range. The lower-case 'x' denotes a variable half-byte, ranging
>> from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
>> points.
> 
> 
> So you only need 128 code points, so there is something else unclear.

(please understand that this is history now, since the PEP has stopped
using PUA characters).

No. You seem to assume that all bytes < 128 decode successfully always.
I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.

Regards,
Martin