[Python-Dev] PEP 460 reboot

Tue Jan 14 00:22:44 CET 2014

Nick Coghlan wrote:
> 
> so the latter would be less of 
> an attractive nuisance when writing code that needs to handle arbitrary 
> binary formats and can't assume ASCII compatibility.

Hang on a moment. What do you mean by code that
"handles arbitrary binary formats"?

As far as I can see, the proposed features are for
code that handles *particular* binary formats. Ones
with well-defined fields that are specified to contain
ASCII-encoded text. It's the programmer's responsibility
to make sure that the fields he's treating as ASCII
really do contain ASCII, just as it's his responsibility
to make sure he reads and writes a text file using
the correct encoding.

Now, it's possible that if you were working from an
incomplete spec and some examples, you might be
led to believe that a particular field was ASCII
when in fact it was some ASCII superset such as
latin1 or utf8. In that case, if you parsed it
assuming ASCII, you would get into trouble of
some sort with bytes greater than 127.

However, the proposed formatting operations are
concerned only with *generating* binary data, not
parsing it. Under Guido's proposed semantics, all
of the ASCII formatting operations are guaranteed
to produce valid ASCII, regardless of what types
or values are thrown at them. So as long as the
field's true encoding is something ASCII-compatible,
you will always generate valid data.

> Because I *want to use* the PEP 460 binary interpolation API, but 
> wouldn't be able to use Guido's more lenient proposal, as it is a bug 
> magnet in the presence of arbitrary binary data.

Where exactly is this "arbitrary binary data" that you
keep talking about? The only place that arbitrary
bytes comes into the picture is through b"%s" % b"...",
and that's defined to just pass the bytes straight
through. I don't see how that could attract any
bugs that weren't already present in the data being
interpolated.

> The LHS may or may not be tainted with assumptions about ASCII 
> compatibility, which means it effectively *is* tainted with such 
> assumptions, which means code that needs to handle arbitrary binary data 
> can't use it and is left without a binary interpolation feature.

If I understand correctly, what concerns you here
is that you can't tell by looking at b"%s" % x
whether it encodes anything as ASCII without knowing
the type of x.

I'm not sure how serious a problem that would be.
Most of the time I think it will be fairly obvious
from the purpose of the code what the type of x
is *intended* to be. If it's not actually that type,
then clearly there's a bug somewhere.

Of all such possible bugs, the one most likely to
arise due to a confusion in the programmer's mind
between text and bytes would be for x to be a string
when it was meant to be bytes or vice versa.

Due to the still-very-strong separation between text
and bytes in Py3, this is unlikely to happen without
something else blowing up first.

Even if it does happen, it won't result in a data-
dependent failure. If b"%s" % 'hello' were defined to
interpolate 'hello'.encode('ascii'), then there *would*
be cause for concern. But this is not what Guido
proposes -- instead he proposes interpolating
ascii('hello') == "'hello'". This is almost certainly
*never* what the file spec calls for, so you'll find
out about it very soon one way or another.

Effectively this means that b"%s" % x where x is a
string is useless, so I'd much prefer it to just
raise an exception in that case to make the failure
immediately obvious. But either way, you're not
going to end up with a latent failure waiting for
some non-ASCII data to come along before you notice
it.

To summarise, I think the idea of binary format strings
being too "tainted" for a program that does not want
to use ASCII formatting to rely on is mostly FUD.

-- 
Greg