[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman
v+python at g.nevcal.com
Tue Apr 28 07:25:15 CEST 2009
On approximately 4/27/2009 8:35 PM, came the following characters from
the keyboard of Martin v. Löwis:
> Glenn Linderman wrote:
>> On approximately 4/27/2009 12:42 PM, came the following characters from
>> the keyboard of Martin v. Löwis:
>>>>> It's a private use area. It will never carry an official character
>>>>> assignment.
>>>> I know that U+F0000 - U+FFFFF is a private use area. I don't find a
>>>> definition of U+F01xx to know what the notation means. Are you picking
>>>> a particular character within the private use area, or a particular
>>>> range, or what?
>>> It's a range. The lower-case 'x' denotes a variable half-byte, ranging
>>> from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
>>> points.
>>
>> So you only need 128 code points, so there is something else unclear.
>
> (please understand that this is history now, since the PEP has stopped
> using PUA characters).
Yes, but having found the latest PEP finally (at least I hope the one at
python.org is the latest, it has quit using PUA anyway), I confirm it is
history. But the same issue applies to the range of half-surrogates.
> No. You seem to assume that all bytes < 128 decode successfully always.
> I believe this assumption is wrong, in general:
>
> py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
> 3-4: illegal multibyte sequence
>
> All bytes are below 128, yet it fails to decode.
Indeed, that was the missing piece. I'd forgotten about the encodings
that use escape sequences, rather than UTF-8, and DBCS. I don't think
those encodings are permitted by POSIX file systems, but I suppose they
could sneak in via Environment variable values, and the like.
The switch from PUA to half-surrogates does not resolve the issues with
the encoding not being a 1-to-1 mapping, though. The very fact that you
think you can get away with use of lone surrogates means that other
people might, accidentally or intentionally, also use lone surrogates
for some other purpose. Even in file names.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-Dev
mailing list