[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Cameron Simpson
cs at zip.com.au
Wed Apr 29 05:27:40 CEST 2009
On 28Apr2009 13:37, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters...
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>> UTF-8 (the PEP actually never proposed this to happen).
>> This introduces an ambiguity: two different files in the same
>> directory may decode to the same string name, if one has the PUA
>> character, and the other has a non-decodable byte that gets decoded
>> to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>> The same ambiguity does *NOT* exist. If a file on disk already
>> contains an invalid surrogate code in its file name, then the UTF-8b
>> decoder will recognize this as invalid, and decode it byte-for-byte,
>> into three surrogate codes. Hence, the file names that are different
>> on disk are also different in memory. No ambiguity.
>
> C. File on disk with the invalid surrogate code, accessed via the str
> interface, no decoding happens, matches in memory the file on disk with
> the byte that translates to the same surrogate, accessed via the bytes
> interface. Ambiguity.
Is this a Windows example, or (now I think on it) an equivalent POSIX example
of using the PEP where the locale encoding is UTF-16?
In either case, I would say one could make an argument for being stricter
in reading in OS-native sequences. Grant that NTFS doesn't prevent
half-surrogates in filenames, and likewise that POSIX won't because to
the OS they're just bytes. On decoding, require well-formed data. When
you hit ill-formed data, treat the nasty half surrogate as a PAIR of
bytes to be escaped in the resulting decode.
Ambiguity avoided.
I'm more concerned with your (yours? someone else's?) mention of shift
characters. I'm unfamiliar with these encodings: to translate such a
thing into a Latin example, is it the case that there are schemes with
valid encodings that look like:
[SHIFT] a b c
which would produce "ABC" in unicode, which is ambiguous with:
A B C
which would also produce "ABC"?
Cheers,
--
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/
Helicopters are considerably more expensive [than fixed wing aircraft],
which is only right because they don't actually fly, but just beat
the air into submission. - Paul Tomblin
More information about the Python-Dev
mailing list