[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Fri Apr 24 21:41:25 CEST 2009

On approximately 4/24/2009 11:40 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Antoine Pitrou writes:
>  > Stephen J. Turnbull <stephen <at> xemacs.org> writes:
>  > > 
>  > > Well, the problem is that both parts are false.  If you didn't start
>  > > with a valid string in a known encoding, you shouldn't treat it as
>  > > characters because it's not.  Hand it to a careful API, and you'll get
>  > > an Exception raised in your face.
>  > 
>  > Which "careful API" are you talking about?
>  >
>  > > OTOH, at least some of those who feel lucky and use it
>  > > naively are going to turn out to be wrong.
>  > 
>  > Why will they turn out to be wrong?

Because the encoding is not reliably reversible.  That is why I proposed 
one that is.

> To quote the PEP:
> """
> While providing a uniform API to non-decodable bytes, this interface
> has the limitation that chosen representation only "works" if the data
> get converted back to bytes with the python-escape error handler
> also. Encoding the data with the locale's encoding and the (default)
> strict error handler will raise an exception, encoding them with UTF-8
> will produce non-sensical data.
> For most applications, we assume that they eventually pass data
> received from a system interface back into the same system
> interfaces.
> """

And so my encoding (1) doesn't alter the data stream for any valid 
Windows file name, and where the naivest of users reside (2) doesn't 
alter the data stream for any Posix file name that was encoded as UTF-8 
sequences and doesn't contain ? characters in the file name [I perceive 
the use of ? in file names to be rare on Posix, because of experience, 
and because of the other problems caused by such use] (3) doesn't 
introduce data puns within applications that are correctly coded to know 
the encoding occurs.  The encoding technique in the PEP not only can 
produce data puns, thus not being reversible, it provides no reliable 
mechanism to know that this has occurred.

> But you can't know that.  These are now "just strings", which could
> end up in pickles and other persistent objects, be passed across
> network interfaces (remote copy, for example), etc, etc, and there is
> no way to guarantee that the recipient will understand the rules,
> unless the application encapsulates them in some kind of
> representation that says "I look like a Unicode but I'm really just
> encoded bytes."  

This could happen.  Well-formed programs need to use the encoding at the 
boundaries.  Python could encapsulate its interfaces to the file system, 
but cannot encapsulate other interfaces.  Fortunately, something that is 
pickled, would probably be unpicked by Python, and therefore all would 
be well.  But any interface that expects a file name, and is not 
encapsulated by Python, must be encapsulated by the application.

> But the whole point is to turn them into plain old
> strings so people *don't have to bother* keeping track.

And if that is the point, it isn't worth doing.  If the point is that it 
can minimize the amount of existing, file name manipulation code that 
uses string manipulations, that must be reworked to be functional during 
a 2to3 migration, then it can be worth doing.  But I think it should be 
done with an encoding that doesn't introduce undetectable data puns, 
whether mine or some different encoding with that characteristic, but 
not the one presently in the PEP, because it does introduce undetectable 
data puns.

> As I already said, this is no worse than the current situation, but it
> gives the impression that Python has a standard "solution".  (Yes, I
> know Martin doesn't claim it's a solution to any of those problems.
> The point is user perception.)
> I have to wonder whether having a standard way of not solving any
> problems is better than having no standard way of not solving any
> problems.  It may be, and it probably can't hurt, which is why I'm +0.

Interesting phraseology there, Stephen!

I'm +1 on the concept, -1 on the PEP, due solely to the lack of a 
reversible encoding.

Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

More information about the Python-Dev mailing list