[Python-Dev] PEP 460: allowing %d and %f and mojibake

Tue Jan 14 05:58:56 CET 2014

Glenn Linderman writes:
 > On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:
 >> Glenn Linderman writes:

 >>> "smuggled binary" (great term borrowed from a different
 >>> subthread) muddies the waters of what you are dealing with.

 >> Not really. The "mud" is one or more of the serious deficiencies.
 >> It can be removed, I believe (and Nick apparently does, too).
 >> "asciistr" is one way to try that.

 > Yes really. Use of smuggled binary means the str containing it can
 > no longer be treated completely as a str. That is "muddier" than
 > having a str that is only a str.

You don't seem to understand what *asciistr* is: it's a *different
type* that is simultaneously compatible in operation with bytes and
str, by automatically converting to whichever it is used with.  If we
used asciistr, str would no longer be muddy (except in cases where we
would have used surrogateescape anyway).

You also don't seem to understand that bytes are conceptually pure
mud.  Anything that is pushed to bytes because you don't know what
type it is (or because at the time the program is written, the type
can't be known) is no longer subject to duck-typing.

So the question is "how is mud best handled?"  Obviously,
incorporating it in str with .decode('latin1') is inappropriate.
However, if you use .decode('ascii') you have your choice of error
handlers.  If you use errors='strict' then no mud can get in.  Use of
any other error handler is obviously a "consenting adults" behavior;
it should only be done when you expect that you can keep the muddy str
from leaking into places where it might be passed to an I/O function.
(Note that the internal processing of an application that never
outputs such a str is completely conformant to the Unicode Standard.
That's not a goal of Python, since surrogateescape is designed to be
used on output too.  But if the developer applies that standard to
each *program component*, he's going to be in pretty good shape.)

If you use asciistr, then you're pretty much in complete control.
The exception is operations that munge individual characters (case
conversion).  If you have a protocol with ASCII keywords but their
case is specified, you'll need to define another type to remove the
case-munging methods if you want that level of safety.

If, as in your proposal, bytes are tagged with descriptions, you are
effectively creating types on the fly.  But if the program doesn't
anticipate that, they're mud.  If the program doesn't anticipate all
of them those descriptions that are unhandled become mud, too.  ITSM
that the "syntax descriptor" feature is already present in Python, and
it's called "class".  So, IMHO, simply converting to an appropriate
Python type on input is what should be done, but in any case, I don't
see how adding a "syntax descriptor" attribute to bytes is going to
improve the situation significantly.

Note that such a class can postpone parsing for efficiency or lack of
information reasons, and store the object as bytes until needed.  But
this is not the same as passing around naked bytes, because the class
can ensure that bytes can't get out, only parsed objects.