<p dir="ltr"><br>
On 14 Jan 2014 04:58, "Guido van Rossum" <<a href="mailto:guido@python.org">guido@python.org</a>> wrote:<br>
><br>
> Let me try rebooting the reboot.<br>
><br>
> My interpretation of Nick's argument is that he are asking for a bytes<br>
> formatting language that doesn't have an implicit ASCII assumption.<br>
><br>
> To me this feels absurd. The formatting codes (%s, %c) themselves are<br>
> expressed as ASCII characters. If you include anything else in the<br>
> format string besides formatting codes (e.g. b'<%s>'), you are giving<br>
> it as ASCII characters. I don't know what characters the EBCDIC codes<br>
> 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but<br>
> it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.</p>
<p dir="ltr">Except we allow string escapes and programmatic creation of format strings, so while ASCII snippets in formatting code are certainly easier to type, they are by no means a mandatory feature of using interpolation operations. I agree </p>
<p dir="ltr">Can you roll your own binary interpolation support with join() and simple concatenation? Yes, but Antoine's proposal provides a clean and reliable approach to flexible binary templating that isn't offered by the more lenient version.</p>
<p dir="ltr">My problem is with telling Python users that if they're working with ASCII compatible data, they get access to a clean interpolation mini-language for templating purposes, but if they aren't, they don't.</p>
<p dir="ltr">That's the part I see as potentially breaking the text model: now you have a convenient API on a core type encouraging you to treat your data as ASCII compatible with implicit serialisation of semantic data as ASCII text, even if that may not be appropriate.</p>
<p dir="ltr">If pure binary interpolation is added at the same time (regardless of the exact spelling, so long as it's as easy to access as the ASCII templating), that objection goes away.</p>
<p dir="ltr">That said, the fact that the interpolation mini-languages themselves assume ASCII is the most compelling rationale I have heard so far for treating interpolation as an operation that inherently assumes ASCII compatibility - you can't use arbitrary bytes in your formatting strings without escaping the formatting characters appropriately. While I don't see that as substantially different to needing to escape them in order to retain them in the output of text or ASCII formatting, it's at least a teachable rationale for the absence of a pure binary equivalent.</p>
<p dir="ltr">> If I had some byte strings in an unknown encoding (but the same<br>
> encoding for all) that I needed to concatenate I would never think of<br>
> '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)<br>
><br>
> If I see some code using *any* formatting operation (regardless of<br>
> whether it's %d, %r, %s or %c) I am going to assume that there is some<br>
> ASCII-ness, and if there isn't, the code's author has obscured their<br>
> goal to me.</p>
<p dir="ltr">Right, that's a rationale I can explain to people. It also occurred to me that it's easier to build pure binary interpolation on top of ASCII interpolation than I previously thought: I can just check all the input values are compatible with memoryview. At that point, attempting to pass in anything that would trigger implicit encoding at the formatting stage will fail.</p>
<p dir="ltr">(Aside: bytes(memoryview(obj)) is also a potentially handy way to avoid the bytes(int)) trap)</p>
<p dir="ltr">> I hear the objections against b'%s' % 'x' returning b"'x'" loud and<br>
> clear, and if the noise about that sub-issue is preventing folks from<br>
> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.<br>
> use %b which would require its argument to be bytes. Those bytes<br>
> should still probably be ASCII-ish, but there's no way to test that.<br>
> That's fine with me and should be fine to Nick as well -- PEP 460<br>
> doesn't check that your encodings match (how could it? :-), nor does<br>
> plain string concatenation using +.</p>
<p dir="ltr">Plus there genuinely are formats where different parts have different encodings and you rely on metadata or format definitions to know what they are.</p>
<p dir="ltr">I would actually suggest something like Brett's approach for %s , but with memoryview in the mix: if the object exports a PEP 3118 buffer, interpolate it directly, otherwise invoke normal string formatting and then do strict ASCII encoding at the end.</p>
<p dir="ltr">That way people don't have to learn new formatting mini-languages and only have two new behaviours to learn: buffer exporters are interpolated directly, anything else is formatted normally and then implicitly encoding as strict ASCII.<br>
</p>
<p dir="ltr">><br>
> In my head I make the following classification of situations where you<br>
> work with bytes and/or text.<br>
><br>
> (A) Pure binary formats (e.g. most IP-level packet formats, media<br>
> files, .pyc files, tar/zip files, compressed data, etc.). These are<br>
> handled using the struct module (e.g. tar/zip) and/or custom C<br>
> extensions (e.g. gzip).<br>
><br>
> (B) Encoded text. Here you should just decode everything into str<br>
> objects and parse your text at that level. If you really want to<br>
> manipulate the data as bytes (e.g. because you have a lot of data to<br>
> process and very light processing) you may be able to do it, but<br>
> unless it's a verbatim copy, you are probably going to make<br>
> assumptions about the encoding. You are also probably going to mess up<br>
> for some encodings (e.g. leave BOM turds in the middle of a file).<br>
><br>
> (C) Loosely text-based protocols and formats that have an ASCII<br>
> assumption in the spec. Most classic Internet protocols (FTP, SMTP,<br>
> HTTP, IRC, etc.) fall in this category; I expect there are also plenty<br>
> of file formats using similar conventions (e.g. mailbox files). These<br>
> protocols and formats often require text-ish manipulations, e.g. for<br>
> case-insensitive headers or commands, or to split things at<br>
> whitespace. This is where I find uses for the current ASCII-assuming<br>
> bytes operations (e.g. b.lower(), b.split(), but also int(b)) and<br>
> where the lack of number formatting (especially %d and %x) is most<br>
> painful. I see no benefit in forcing the programmer writing such<br>
> protocol code handling to use more cumbersome ways of converting<br>
> between numbers and bytes, nor in forcing them to insert an<br>
> encoding/decoding layer -- these protocols often switch between text<br>
> and binary data at line boundaries, so the most basic part of parsing<br>
> (splitting the input into lines) must still happen in the realm of<br>
> bytes.<br>
><br>
> IMO PEP 460 and the mindset that goes with it don't apply to any of<br>
> these three cases.<br>
><br>
> Also, IMO requiring a new type to handle (C) also seems adding too<br>
> much complexity, and adds to porting efforts. I may have felt<br>
> differently in the past, but ATM I feel that if newer versions of<br>
> Python 3 make porting of Python 2 code easier, through minor<br>
> compromises, that's a *good* thing. (Example: adding u"..." literals<br>
> to 3.3.)</p>
<p dir="ltr">You've persuaded me well enough that I think my last proposal above goes *further* than your original one in allowing text formatting when interpolating to ASCII compatible formats :)</p>
<p dir="ltr">Cheers,<br>
Nick.</p>
<p dir="ltr">><br>
> --<br>
> --Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>
</p>