<p dir="ltr"><br>

On 14 Jan 2014 04:58, "Guido van Rossum" <<a href="mailto:guido@python.org">guido@python.org</a>> wrote:<br>

><br>

> Let me try rebooting the reboot.<br>

><br>

> My interpretation of Nick's argument is that he are asking for a bytes<br>

> formatting language that doesn't have an implicit ASCII assumption.<br>

><br>

> To me this feels absurd. The formatting codes (%s, %c) themselves are<br>

> expressed as ASCII characters. If you include anything else in the<br>

> format string besides formatting codes (e.g. b'<%s>'), you are giving<br>

> it as ASCII characters. I don't know what characters the EBCDIC codes<br>

> 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but<br>

> it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.</p>

<p dir="ltr">Except we allow string escapes and programmatic creation of format strings, so while ASCII snippets in formatting code are certainly easier to type, they are by no means a mandatory feature of using interpolation operations. I agree </p>


<p dir="ltr">Can you roll your own binary interpolation support with join() and simple concatenation? Yes, but Antoine's proposal provides a clean and reliable approach to flexible binary templating that isn't offered by the more lenient version.</p>


<p dir="ltr">My problem is with telling Python users that if they're working with ASCII compatible data, they get access to a clean interpolation mini-language for templating purposes, but if they aren't, they don't.</p>


<p dir="ltr">That's the part I see as potentially breaking the text model: now you have a convenient API on a core type encouraging you to treat your data as ASCII compatible with implicit serialisation of semantic data as ASCII text, even if that may not be appropriate.</p>


<p dir="ltr">If pure binary interpolation is added at the same time (regardless of the exact spelling, so long as it's as easy to access as the ASCII templating), that objection goes away.</p>

<p dir="ltr">That said, the fact that the interpolation mini-languages themselves assume ASCII is the most compelling rationale I have heard so far for treating interpolation as an operation that inherently assumes ASCII compatibility - you can't use arbitrary bytes in your formatting strings without escaping the formatting characters appropriately. While I don't see that as substantially different to needing to escape them in order to retain them in the output of text or ASCII formatting, it's at least a teachable rationale for the absence of a pure binary equivalent.</p>


<p dir="ltr">> If I had some byte strings in an unknown encoding (but the same<br>

> encoding for all) that I needed to concatenate I would never think of<br>

> '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)<br>

><br>

> If I see some code using *any* formatting operation (regardless of<br>

> whether it's %d, %r, %s or %c) I am going to assume that there is some<br>

> ASCII-ness, and if there isn't, the code's author has obscured their<br>

> goal to me.</p>

<p dir="ltr">Right, that's a rationale I can explain to people. It also occurred to me that it's easier to build pure binary interpolation on top of ASCII interpolation than I previously thought: I can just check all the input values are compatible with memoryview. At that point, attempting to pass in anything that would trigger implicit encoding at the formatting stage will fail.</p>


<p dir="ltr">(Aside: bytes(memoryview(obj)) is also a potentially handy way to avoid the bytes(int)) trap)</p>

<p dir="ltr">> I hear the objections against b'%s' % 'x' returning b"'x'" loud and<br>

> clear, and if the noise about that sub-issue is preventing folks from<br>

> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.<br>

> use %b which would require its argument to be bytes. Those bytes<br>

> should still probably be ASCII-ish, but there's no way to test that.<br>

> That's fine with me and should be fine to Nick as well -- PEP 460<br>

> doesn't check that your encodings match (how could it? :-), nor does<br>

> plain string concatenation using +.</p>

<p dir="ltr">Plus there genuinely are formats where different parts have different encodings and you rely on metadata or format definitions to know what they are.</p>

<p dir="ltr">I would actually suggest something like Brett's approach for %s , but with memoryview in the mix: if the object exports a PEP 3118 buffer, interpolate it directly, otherwise invoke normal string formatting and then do strict ASCII encoding at the end.</p>


<p dir="ltr">That way people don't have to learn new formatting mini-languages and only have two new behaviours to learn: buffer exporters are interpolated directly, anything else is formatted normally and then implicitly encoding as strict ASCII.<br>

</p>

<p dir="ltr">><br>

> In my head I make the following classification of situations where you<br>

> work with bytes and/or text.<br>

><br>

> (A) Pure binary formats (e.g. most IP-level packet formats, media<br>

> files, .pyc files, tar/zip files, compressed data, etc.). These are<br>

> handled using the struct module (e.g. tar/zip) and/or custom C<br>

> extensions (e.g. gzip).<br>

><br>

> (B) Encoded text. Here you should just decode everything into str<br>

> objects and parse your text at that level. If you really want to<br>

> manipulate the data as bytes (e.g. because you have a lot of data to<br>

> process and very light processing) you may be able to do it, but<br>

> unless it's a verbatim copy, you are probably going to make<br>

> assumptions about the encoding. You are also probably going to mess up<br>

> for some encodings (e.g. leave BOM turds in the middle of a file).<br>

><br>

> (C) Loosely text-based protocols and formats that have an ASCII<br>

> assumption in the spec. Most classic Internet protocols (FTP, SMTP,<br>

> HTTP, IRC, etc.) fall in this category; I expect there are also plenty<br>

> of file formats using similar conventions (e.g. mailbox files). These<br>

> protocols and formats often require text-ish manipulations, e.g. for<br>

> case-insensitive headers or commands, or to split things at<br>

> whitespace. This is where I find uses for the current ASCII-assuming<br>

> bytes operations (e.g. b.lower(), b.split(), but also int(b)) and<br>

> where the lack of number formatting (especially %d and %x) is most<br>

> painful. I see no benefit in forcing the programmer writing such<br>

> protocol code handling to use more cumbersome ways of converting<br>

> between numbers and bytes, nor in forcing them to insert an<br>

> encoding/decoding layer -- these protocols often switch between text<br>

> and binary data at line boundaries, so the most basic part of parsing<br>

> (splitting the input into lines) must still happen in the realm of<br>

> bytes.<br>

><br>

> IMO PEP 460 and the mindset that goes with it don't apply to any of<br>

> these three cases.<br>

><br>

> Also, IMO requiring a new type to handle (C) also seems adding too<br>

> much complexity, and adds to porting efforts. I may have felt<br>

> differently in the past, but ATM I feel that if newer versions of<br>

> Python 3 make porting of Python 2 code easier, through minor<br>

> compromises, that's a *good* thing. (Example: adding u"..." literals<br>

> to 3.3.)</p>

<p dir="ltr">You've persuaded me well enough that I think my last proposal above goes *further* than your original one in allowing text formatting when interpolating to ASCII compatible formats :)</p>

<p dir="ltr">Cheers,<br>

Nick.</p>

<p dir="ltr">><br>

> --<br>

> --Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>

</p>