[Python-Dev] PEP 460: allowing %d and %f and mojibake

Glenn Linderman v+python at g.nevcal.com
Tue Jan 14 07:01:35 CET 2014

On 1/13/2014 8:58 PM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>   > On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:
>   >> Glenn Linderman writes:
>   >>> "smuggled binary" (great term borrowed from a different
>   >>> subthread) muddies the waters of what you are dealing with.
>   >> Not really. The "mud" is one or more of the serious deficiencies.
>   >> It can be removed, I believe (and Nick apparently does, too).
>   >> "asciistr" is one way to try that.
>   > Yes really. Use of smuggled binary means the str containing it can
>   > no longer be treated completely as a str. That is "muddier" than
>   > having a str that is only a str.
> You don't seem to understand what *asciistr* is: it's a *different
> type* that is simultaneously compatible in operation with bytes and
> str, by automatically converting to whichever it is used with.  If we
> used asciistr, str would no longer be muddy (except in cases where we
> would have used surrogateescape anyway).

No, I haven't fully understood what asciistr is, only Nick's several 
descriptions of it.

I do understand it is a different type, and can interact with both bytes 
and str.

If it automatically converts, then it sounds terribly inefficient with 
long data, but I didn't hear Nick say that, but maybe I missed it.

You mentioned asciistr in the snippet above, but most of what you have 
been writing about smuggled binary was using str... I hadn't grokked 
that you were now a full-fledged proponent of asciistr, and were now 
proposing to put your smuggled binary into asciistr.

> You also don't seem to understand that bytes are conceptually pure
> mud.  Anything that is pushed to bytes because you don't know what
> type it is (or because at the time the program is written, the type
> can't be known) is no longer subject to duck-typing.

If you are talking str, then bytes are mud. If you are talking bytes, 
then str is mud.

I'm wouldn't think of "pushing something to bytes" (whatever that means) 
because I don't know what it is... I may manipulate bytes because I know 
what they are, and that is the most appropriate form for that piece of 
data for the present manipulations; if something is text, I want to 
transform the bytes to str if I need to manipulate it, parse it, or 
present it. If I don't know what something is, it is because it didn't 
meet my expectations of what it should be, and I want to present an 
error, which may include some representation (probably hex) of some of 
the bytes that cannot be understood.

But if I'm "pushing to bytes", which I would interpret as creating a 
byte stream, then I know what I have, and I need to convert it to bytes 
either to store it in a file, or communicate it to another process. 
That's far from not knowing what it is.

> So the question is "how is mud best handled?"  Obviously,
> incorporating it in str with .decode('latin1') is inappropriate.

Glad to hear you say that; I thought that was what you were promoting, 
when you said, in an earlier message:

On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>   > the proposals to embed binary in Unicode by abusing Latin-1
>   > encoding.
> Those aren't "proposals", they are currently feasible techniques in
> Python 3 for*some*  use cases.

Back to this one, though.

> However, if you use .decode('ascii') you have your choice of error
> handlers.  If you use errors='strict' then no mud can get in.  Use of
> any other error handler is obviously a "consenting adults" behavior;
> it should only be done when you expect that you can keep the muddy str
> from leaking into places where it might be passed to an I/O function.
> (Note that the internal processing of an application that never
> outputs such a str is completely conformant to the Unicode Standard.
> That's not a goal of Python, since surrogateescape is designed to be
> used on output too.  But if the developer applies that standard to
> each *program component*, he's going to be in pretty good shape.)
> If you use asciistr, then you're pretty much in complete control.
> The exception is operations that munge individual characters (case
> conversion).  If you have a protocol with ASCII keywords but their
> case is specified, you'll need to define another type to remove the
> case-munging methods if you want that level of safety.

The above doesn't sound like a use case I care about, much.  If I get a 
garbled file without an accurate definition of what it contains, then I 
probably want to stick it in the trash. The only "processing" that can 
be done is to pass on the garbage to someone else, and stink up their 
system, and that can be done purely as bytes.

> If, as in your proposal, bytes are tagged with descriptions, you are
> effectively creating types on the fly.  But if the program doesn't
> anticipate that, they're mud.

Interpreting a file format or wire protocol requires parsing and 
manipulating an incoming byte stream, and converting it to useful types 
in the program... if it can't be converted to useful types, then why 
bother parsing it? So the rest of my discussion was not talking about 
creating types on the fly, but on a systematic way of converting a 
well-specified byte stream (file format, or wire protocol) to a 
collection of useful types, in an organized manner, that might be 
verifiable, rather than with ad-hoc coding. And similarly in reverse... 
after manipulating the objects to perform useful transformations, 
possibly based on user input (that's what a program does), then to write 
them back out to a byte stream in modified form, in an organized manner, 
that might be verifiable, rather than with ad-hoc coding.

> If the program doesn't anticipate all
> of them those descriptions that are unhandled become mud, too.  ITSM
> that the "syntax descriptor" feature is already present in Python, and
> it's called "class".  So, IMHO, simply converting to an appropriate
> Python type on input is what should be done, but in any case, I don't
> see how adding a "syntax descriptor" attribute to bytes is going to
> improve the situation significantly.

Syntax descriptors would be a description of the substructures of a file 
format (think TIFF files) or wire protocol, and might allow parsing of 
binary files similarly to the way computer languages are parsed, 
producing errors when encountering mud.  What you dismiss as "converting 
to an appropriate Python type on input" can be quite complex when for 
complex file formats, but it is the process of converting to such a 
heirarchy of Python objects that was to be described by the syntax 

> Note that such a class can postpone parsing for efficiency or lack of
> information reasons, and store the object as bytes until needed.  But
> this is not the same as passing around naked bytes, because the class
> can ensure that bytes can't get out, only parsed objects.

Sure, it could.  My proposal is suggesting that the distribution of 
bytes to objects in a hierarchy might be automated in the sense of 
parsing the binary format, so that instead of writing "a class" for the 
whole, that class would be pre-written, based on the syntax description 
of the file, and matching that with the syntax descriptions of the 
component types.  It is really a topic for python ideas, to flesh it out 
further, but it seemed related, as a use case, a class that would live 
on the bytes processing boundary, producing other objects, some of which 
may be text strings, in an organized, probably hierarchical, collection 
of objects.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140113/060b864e/attachment.html>

More information about the Python-Dev mailing list