[Python-Dev] PEP 460 reboot

Nick Coghlan ncoghlan at gmail.com
Mon Jan 13 23:43:05 CET 2014


On 14 Jan 2014 04:58, "Guido van Rossum" <guido at python.org> wrote:
>
> Let me try rebooting the reboot.
>
> My interpretation of Nick's argument is that he are asking for a bytes
> formatting language that doesn't have an implicit ASCII assumption.
>
> To me this feels absurd. The formatting codes (%s, %c) themselves are
> expressed as ASCII characters. If you include anything else in the
> format string besides formatting codes (e.g. b'<%s>'), you are giving
> it as ASCII characters. I don't know what characters the EBCDIC codes
> 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but
> it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.

Except we allow string escapes and programmatic creation of format strings,
so while ASCII snippets in formatting code are certainly easier to type,
they are by no means a mandatory feature of using interpolation operations.
I agree

Can you roll your own binary interpolation support with join() and simple
concatenation? Yes, but Antoine's proposal provides a clean and reliable
approach to flexible binary templating that isn't offered by the more
lenient version.

My problem is with telling Python users that if they're working with ASCII
compatible data, they get access to a clean interpolation mini-language for
templating purposes, but if they aren't, they don't.

That's the part I see as potentially breaking the text model: now you have
a convenient API on a core type encouraging you to treat your data as ASCII
compatible with implicit serialisation of semantic data as ASCII text, even
if that may not be appropriate.

If pure binary interpolation is added at the same time (regardless of the
exact spelling, so long as it's as easy to access as the ASCII templating),
that objection goes away.

That said, the fact that the interpolation mini-languages themselves assume
ASCII is the most compelling rationale I have heard so far for treating
interpolation as an operation that inherently assumes ASCII compatibility -
you can't use arbitrary bytes in your formatting strings without escaping
the formatting characters appropriately. While I don't see that as
substantially different to needing to escape them in order to retain them
in the output of text or ASCII formatting, it's at least a teachable
rationale for the absence of a pure binary equivalent.

> If I had some byte strings in an unknown encoding (but the same
> encoding for all) that I needed to concatenate I would never think of
> '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)
>
> If I see some code using *any* formatting operation (regardless of
> whether it's %d, %r, %s or %c) I am going to assume that there is some
> ASCII-ness, and if there isn't, the code's author has obscured their
> goal to me.

Right, that's a rationale I can explain to people. It also occurred to me
that it's easier to build pure binary interpolation on top of ASCII
interpolation than I previously thought: I can just check all the input
values are compatible with memoryview. At that point, attempting to pass in
anything that would trigger implicit encoding at the formatting stage will
fail.

(Aside: bytes(memoryview(obj)) is also a potentially handy way to avoid the
bytes(int)) trap)

> I hear the objections against b'%s' % 'x' returning b"'x'" loud and
> clear, and if the noise about that sub-issue is preventing folks from
> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
> use %b which would require its argument to be bytes. Those bytes
> should still probably be ASCII-ish, but there's no way to test that.
> That's fine with me and should be fine to Nick as well -- PEP 460
> doesn't check that your encodings match (how could it? :-), nor does
> plain string concatenation using +.

Plus there genuinely are formats where different parts have different
encodings and you rely on metadata or format definitions to know what they
are.

I would actually suggest something like Brett's approach for %s , but with
memoryview in the mix: if the object exports a PEP 3118 buffer, interpolate
it directly, otherwise invoke normal string formatting and then do strict
ASCII encoding at the end.

That way people don't have to learn new formatting mini-languages and only
have two new behaviours to learn: buffer exporters are interpolated
directly, anything else is formatted normally and then implicitly encoding
as strict ASCII.

>
> In my head I make the following classification of situations where you
> work with bytes and/or text.
>
> (A) Pure binary formats (e.g. most IP-level packet formats, media
> files, .pyc files, tar/zip files, compressed data, etc.). These are
> handled using the struct module (e.g. tar/zip) and/or custom C
> extensions (e.g. gzip).
>
> (B) Encoded text. Here you should just decode everything into str
> objects and parse your text at that level. If you really want to
> manipulate the data as bytes (e.g. because you have a lot of data to
> process and very light processing) you may be able to do it, but
> unless it's a verbatim copy, you are probably going to make
> assumptions about the encoding. You are also probably going to mess up
> for some encodings (e.g. leave BOM turds in the middle of a file).
>
> (C) Loosely text-based protocols and formats that have an ASCII
> assumption in the spec. Most classic Internet protocols (FTP, SMTP,
> HTTP, IRC, etc.) fall in this category; I expect there are also plenty
> of file formats using similar conventions (e.g. mailbox files). These
> protocols and formats often require text-ish manipulations, e.g. for
> case-insensitive headers or commands, or to split things at
> whitespace. This is where I find uses for the current ASCII-assuming
> bytes operations (e.g. b.lower(), b.split(), but also int(b)) and
> where the lack of number formatting (especially %d and %x) is most
> painful. I see no benefit in forcing the programmer writing such
> protocol code handling to use more cumbersome ways of converting
> between numbers and bytes, nor in forcing them to insert an
> encoding/decoding layer -- these protocols often switch between text
> and binary data at line boundaries, so the most basic part of parsing
> (splitting the input into lines) must still happen in the realm of
> bytes.
>
> IMO PEP 460 and the mindset that goes with it don't apply to any of
> these three cases.
>
> Also, IMO requiring a new type to handle (C) also seems adding too
> much complexity, and adds to porting efforts. I may have felt
> differently in the past, but ATM I feel that if newer versions of
> Python 3 make porting of Python 2 code easier, through minor
> compromises, that's a *good* thing. (Example: adding u"..." literals
> to 3.3.)

You've persuaded me well enough that I think my last proposal above goes
*further* than your original one in allowing text formatting when
interpolating to ASCII compatible formats :)

Cheers,
Nick.

>
> --
> --Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140114/791a9e54/attachment-0001.html>


More information about the Python-Dev mailing list