[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Mon Apr 27 08:39:41 CEST 2009

On approximately 4/25/2009 5:35 AM, came the following characters from 
the keyboard of Martin v. Löwis:
>> Because the encoding is not reliably reversible.
> 
> Why do you say that? The encoding is completely reversible
> (unless we disagree on what "reversible" means).
> 
>> I'm +1 on the concept, -1 on the PEP, due solely to the lack of a
>> reversible encoding.
> 
> Then please provide an example for a setup where it is not reversible.
> 
> Regards,
> Martin

It is reversible if you know that it is decoded, and apply the encoding. 
  But if you don't know that has been encoded, then applying the reverse 
transform can convert an undecoded str that matches the decoded str to 
the form that it could have, but never did take.

The problem is that there is no guarantee that the str interface 
provides only strictly conforming Unicode, so decoding bytes to 
non-strictly conforming Unicode, can result in a data pun between 
non-strictly conforming Unicode coming from the str interface vs bytes 
being decoded to non-strictly conforming Unicode coming from the bytes 
interface.

Any particular problem that always consistently uses one or the other 
(bytes vs str) APIs under the covers might never be affected by such a 
data pun, but programs that may use both types of interface could 
potentially see a data pun.

If your PEP depends on consistent use of one or the other type of 
interface, you should say so, and if the platform only provides that 
type of interface, maybe all is well.  Both types of interfaces are 
available on Windows, perhaps POSIX only provides native bytes 
interfaces, and if the PEP is the only way to provide str interfaces, 
then perhaps consistency use is required.

There are still issues regarding how Windows and POSIX programs that are 
sharing cross-mounted file systems might communicate file names between 
each other, which is not at all clear from the PEP.  If this is an 
insoluble or un-addressed issue, it should be stated.  (It is probably 
insoluble, due to there being multiple ways that the cross-mounted file 
systems might translate names; but if there are, can we learn something 
from the rules the mounting systems use, to be compatible with (one of) 
them, or not.

Together with your change to avoid using PUA characters, and the rule 
suggested by MRAB in another branch of this thread, of treating 
half-surrogates as invalid byte sequences may avoid the data puns I'm 
concerned about.

It is not clear how half-surrogate characters would be displayed, when 
the user prints or displays such a file name string.  It would seem that 
programs that display file names to users might still have issues with 
such; an escaping mechanism that uses displayable characters would have 
an advantage there.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking