<p dir="ltr"><br>

On 16 Jan 2014 11:45, "Carl Meyer" <<a href="mailto:carl@oddbird.net">carl@oddbird.net</a>> wrote:<br>

><br>

> Hi Ethan,<br>

><br>

> I haven't chimed into this discussion, but the direction it's headed<br>

> recently seems right to me. Thanks for putting together a PEP. Some<br>

> comments on it:<br>

><br>

> On 01/15/2014 05:13 PM, Ethan Furman wrote:<br>

> > ============================<br>

> > Abstract<br>

> > ========<br>

> ><br>

> > This PEP proposes adding the % and {} formatting operations from str to<br>

> > bytes [1].<br>

><br>

> I think the PEP could really use a rationale section summarizing _why_<br>

> these formatting operations are being added to bytes; namely that they<br>

> are useful when working with various ASCIIish-but-not-properly-text<br>

> network protocols and file formats, and in particular when porting code<br>

> dealing with such formats/protocols from Python 2.<br>

><br>

> Also I think it would be useful to have a section summarizing the<br>

> primary objections that have been raised, and why those objections have<br>

> been overruled (presuming the PEP is accepted). For instance: the main<br>

> objection, AIUI, has been that the bytes type is for pure bytes-handling<br>

> with no assumptions about encoding, and thus we should not add features<br>

> to it that assume ASCIIness, and that may be attractive nuisances for<br>

> people writing bytes-handling code that should not assume ASCIIness but<br>

> will once they use the feature.</p>

<p dir="ltr">Close, but not quite - the concern was that this was a feature that didn't *inherently* imply a restriction to ASCII compatible data, but only did so when the numeric formatting codes were used. This made it a source of value dependent compatibility errors based on the format string, akin to the kind of value dependent errors seen when implicitly encoding arbitrary text as ASCII.</p>


<p dir="ltr">Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data, thus placing at least the mod-formatting operation in the same category as the currently existing valid-for-sufficiently-ASCII-compatible-data only operations.</p>


<p dir="ltr">Current discussions suggest to me that the argument against implicit encoding operations that introduce latent data driven defects may still apply to bytes.format though, so I've reverted to being -1 on that.</p>


<p dir="ltr">Cheers,<br>

Nick.<br></p>

<p dir="ltr">>And the refutation: that the bytes type<br>

> already provides some operations that assume ASCIIness, and these new<br>

> formatting features are no more of an attractive nuisance than those;<br>

> since the syntax of the formatting mini-languages themselves itself<br>

> assumes ASCIIness, there is not likely to be any temptation to use it<br>

> with binary data that cannot.<br>

><br>

> Although it can be hard to arrive at accurate and agreed-on summaries of<br>

> the discussion, recording such summaries in the PEP is important; it may<br>

> help save our future selves and colleagues from having to revisit all<br>

> these same discussions and megathreads.<br>

><br>

> > Overriding Principles<br>

> > =====================<br>

> ><br>

> > In order to avoid the problems of auto-conversion and value-generated<br>

> > exceptions,<br>

> > all object checking will be done via isinstance, not by values contained<br>

> > in a<br>

> > Unicode representation.  In other words::<br>

> ><br>

> >   - duck-typing to allow/reject entry into a byte-stream<br>

> >   - no value generated errors<br>

><br>

> This seems self-contradictory; "isinstance" is type-checking, which is<br>

> the opposite of duck-typing. A duck-typing implementation would not use<br>

> isinstance, it would call / check for the existence of a certain magic<br>

> method instead.<br>

><br>

> I think it might also be good to expand (very) slightly on what "the<br>

> problems of auto-conversion and value-generated exceptions" are; that<br>

> is, that the benefit of Python 3's model is that encoding is explicit,<br>

> not implicit, making it harder to unwittingly write code that works as<br>

> long as all data is ASCII, but fails as soon as someone feeds in<br>

> non-ASCII text data.<br>

><br>

> Not everyone who reads this PEP will be steeped in years of discussion<br>

> about the relative merits of the Python 2 vs 3 models; it doesn't hurt<br>

> to spell out a few assumptions.<br>

><br>

><br>

> > Proposed semantics for bytes formatting<br>

> > =======================================<br>

> ><br>

> > %-interpolation<br>

> > ---------------<br>

> ><br>

> > All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)<br>

> > will be supported, and will work as they do for str, including the<br>

> > padding, justification and other related modifiers, except locale.<br>

> ><br>

> > Example::<br>

> ><br>

> >    >>> b'%4x' % 10<br>

> >    b'   a'<br>

> ><br>

> > %c will insert a single byte, either from an int in range(256), or from<br>

> > a bytes argument of length 1.<br>

> ><br>

> > Example:<br>

> ><br>

> >     >>> b'%c' % 48<br>

> >     b'0'<br>

> ><br>

> >     >>> b'%c' % b'a'<br>

> >     b'a'<br>

> ><br>

> > %s is restricted in what it will accept::<br>

> ><br>

> >   - input type supports Py_buffer?<br>

> >     use it to collect the necessary bytes<br>

> ><br>

> >   - input type is something else?<br>

> >     use its __bytes__ method; if there isn't one, raise an exception [2]<br>

> ><br>

> > Examples:<br>

> ><br>

> >     >>> b'%s' % b'abc'<br>

> >     b'abc'<br>

> ><br>

> >     >>> b'%s' % 3.14<br>

> >     Traceback (most recent call last):<br>

> >     ...<br>

> >     TypeError: 3.14 has no __bytes__ method<br>

> ><br>

> >     >>> b'%s' % 'hello world!'<br>

> >     Traceback (most recent call last):<br>

> >     ...<br>

> >     TypeError: 'hello world' has no __bytes__ method, perhaps you need<br>

> > to encode it?<br>

> ><br>

> > .. note::<br>

> ><br>

> >    Because the str type does not have a __bytes__ method, attempts to<br>

> >    directly use 'a string' as a bytes interpolation value will raise an<br>

> >    exception.  To use 'string' values, they must be encoded or otherwise<br>

> >    transformed into a bytes sequence::<br>

> ><br>

> >       'a string'.encode('latin-1')<br>

> ><br>

> > format<br>

> > ------<br>

> ><br>

> > The format mini language codes, where they correspond with the<br>

> > %-interpolation codes,<br>

> > will be used as-is, with three exceptions::<br>

> ><br>

> >   - !s is not supported, as {} can mean the default for both str and<br>

> > bytes, in both<br>

> >     Py2 and Py3.<br>

> >   - !b is supported, and new Py3k code can use it to be explicit.<br>

> >   - no other __format__ method will be called.<br>

> ><br>

> > Numeric Format Codes<br>

> > --------------------<br>

> ><br>

> > To properly handle int and float subclasses, int(), index(), and float()<br>

> > will be called on the<br>

> > objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).<br>

> ><br>

> > Unsupported codes<br>

> > -----------------<br>

> ><br>

> > %r (which calls __repr__), and %a (which calls ascii() on __repr__) are<br>

> > not supported.<br>

> ><br>

> > !r and !a are not supported.<br>

> ><br>

> > The n integer and float format code is not supported.<br>

> ><br>

> ><br>

> > Open Questions<br>

> > ==============<br>

> ><br>

> > Currently non-numeric objects go through::<br>

> ><br>

> >   - Py_buffer<br>

> >   - __bytes__<br>

> >   - failure<br>

> ><br>

> > Do we want to add a __format_bytes__ method in there?<br>

> ><br>

> >   - Guaranteed to produce only ascii (as in b'10', not b'\x0a')<br>

> >   - Makes more sense than using __bytes__ to produce ascii output<br>

> >   - What if an object has both __bytes__ and __format_bytes__?<br>

> ><br>

> > Do we need to support all the numeric format codes?  The floating point<br>

> > exponential formats seem less appropriate, for example.<br>

> ><br>

> ><br>

> > Proposed variations<br>

> > ===================<br>

> ><br>

> > It was suggested to let %s accept numbers, but since numbers have their own<br>

> > format codes this idea was discarded.<br>

> ><br>

> > It has been suggested to use %b for bytes instead of %s.<br>

> ><br>

> >   - Rejected as %b does not exist in Python 2.x %-interpolation, which is<br>

> >     why we are using %s.<br>

> ><br>

> > It has been proposed to automatically use .encode('ascii','strict') for str<br>

> > arguments to %s.<br>

> ><br>

> >   - Rejected as this would lead to intermittent failures.  Better to<br>

> > have the<br>

> >     operation always fail so the trouble-spot can be correctly fixed.<br>

> ><br>

> > It has been proposed to have %s return the ascii-encoded repr when the<br>

> > value<br>

> > is a str  (b'%s' % 'abc'  --> b"'abc'").<br>

> ><br>

> >   - Rejected as this would lead to hard to debug failures far from the<br>

> > problem<br>

> >     site.  Better to have the operation always fail so the trouble-spot<br>

> > can be<br>

> >     easily fixed.<br>

> ><br>

> ><br>

> > Footnotes<br>

> > =========<br>

> ><br>

> > .. [1] string.Template is not under consideration.<br>

> > .. [2] TypeError, ValueError, or UnicodeEncodeError?<br>

><br>

> TypeError seems right to me. Definitely not UnicodeEncodeError - refusal<br>

> to implicitly encode is not at all the same thing as an encoding error.<br>

><br>

> Carl<br>

> _______________________________________________<br>

> Python-Dev mailing list<br>

> <a href="mailto:Python-Dev@python.org">Python-Dev@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/python-dev">https://mail.python.org/mailman/listinfo/python-dev</a><br>

> Unsubscribe: <a href="https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com">https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com</a><br>

</p>