[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman
v+python at g.nevcal.com
Mon Apr 27 08:39:41 CEST 2009
On approximately 4/25/2009 5:35 AM, came the following characters from
the keyboard of Martin v. Löwis:
>> Because the encoding is not reliably reversible.
>
> Why do you say that? The encoding is completely reversible
> (unless we disagree on what "reversible" means).
>
>> I'm +1 on the concept, -1 on the PEP, due solely to the lack of a
>> reversible encoding.
>
> Then please provide an example for a setup where it is not reversible.
>
> Regards,
> Martin
It is reversible if you know that it is decoded, and apply the encoding.
But if you don't know that has been encoded, then applying the reverse
transform can convert an undecoded str that matches the decoded str to
the form that it could have, but never did take.
The problem is that there is no guarantee that the str interface
provides only strictly conforming Unicode, so decoding bytes to
non-strictly conforming Unicode, can result in a data pun between
non-strictly conforming Unicode coming from the str interface vs bytes
being decoded to non-strictly conforming Unicode coming from the bytes
interface.
Any particular problem that always consistently uses one or the other
(bytes vs str) APIs under the covers might never be affected by such a
data pun, but programs that may use both types of interface could
potentially see a data pun.
If your PEP depends on consistent use of one or the other type of
interface, you should say so, and if the platform only provides that
type of interface, maybe all is well. Both types of interfaces are
available on Windows, perhaps POSIX only provides native bytes
interfaces, and if the PEP is the only way to provide str interfaces,
then perhaps consistency use is required.
There are still issues regarding how Windows and POSIX programs that are
sharing cross-mounted file systems might communicate file names between
each other, which is not at all clear from the PEP. If this is an
insoluble or un-addressed issue, it should be stated. (It is probably
insoluble, due to there being multiple ways that the cross-mounted file
systems might translate names; but if there are, can we learn something
from the rules the mounting systems use, to be compatible with (one of)
them, or not.
Together with your change to avoid using PUA characters, and the rule
suggested by MRAB in another branch of this thread, of treating
half-surrogates as invalid byte sequences may avoid the data puns I'm
concerned about.
It is not clear how half-surrogate characters would be displayed, when
the user prints or displays such a file name string. It would seem that
programs that display file names to users might still have issues with
such; an escaping mechanism that uses displayable characters would have
an advantage there.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-Dev
mailing list