PEP 461 - Adding % and {} formatting to bytes
This PEP goes a but further than PEP 460 does, and hopefully spells things out in enough detail so there is no confusion as to what is meant. -- ~Ethan~
Duh. Here's the text, as well. ;) PEP: 461 Title: Adding % and {} formatting to bytes Version: $Revision$ Last-Modified: $Date$ Author: Ethan Furman <ethan@stoneleaf.us> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-01-13 Python-Version: 3.5 Post-History: 2014-01-13 Resolution: Abstract ======== This PEP proposes adding the % and {} formatting operations from str to bytes. Proposed semantics for bytes formatting ======================================= %-interpolation --------------- All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers. Example:: >>> b'%4x' % 10 b' a' %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. Example: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' %s, because it is the most general, has the most convoluted resolution: - input type is bytes? pass it straight through - input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly) - input type is something else? use its __bytes__ method; if there isn't one, raise an exception [3] Examples: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 b'3.14' >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it? .. note:: Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence:: 'a string'.encode('latin-1') format ------ The format mini language will be used as-is, with the behaviors as listed for %-interpolation. Open Questions ============== For %s there has been some discussion of trying to use the buffer protocol (Py_buffer) before trying __bytes__. This question should be answered before the PEP is implemented. Proposed variations =================== It has been suggested to use %b for bytes instead of %s. - Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s. It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s. - Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed. It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'"). - Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed. Foot notes ========== .. [1] Not sure if this should be the numeric __str__ or the numeric __repr__, or if there's any difference .. [2] Any proper numeric class would then have to provide an ascii representation of its value, either via __repr__ or __str__ (whichever we choose in [1]). .. [3] TypeError, ValueError, or UnicodeEncodeError? Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
On Tue, 14 Jan 2014 11:56:25 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
%s, because it is the most general, has the most convoluted resolution:
- input type is bytes? pass it straight through
It should try to get a Py_buffer instead.
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
What is the definition of "numeric"? Regards Antoine.
On 01/14/2014 12:57 PM, Antoine Pitrou wrote:
On Tue, 14 Jan 2014 11:56:25 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
%s, because it is the most general, has the most convoluted resolution:
- input type is bytes? pass it straight through
It should try to get a Py_buffer instead.
Meaning any bytes or bytes-subtype will support the Py_buffer protocol, and this should be the first thing we try? Sounds good. For that matter, should the first test be "does this object support Py_buffer" and not worry about it being isinstance(obj, bytes)?
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
What is the definition of "numeric"?
That is a key question. Obviously we have int, float, and complex. We also have Decimal. But what about Fraction? Or some users numeric class that doesn't inherit from a core numeric type? Wherever we draw the line, we need to make it's well-documented. -- ~Ethan~
On Tue, 14 Jan 2014 13:07:57 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
Meaning any bytes or bytes-subtype will support the Py_buffer protocol, and this should be the first thing we try?
Sounds good.
For that matter, should the first test be "does this object support Py_buffer" and not worry about it being isinstance(obj, bytes)?
Yes, unless the implementation wants to micro-optimize stuff.
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
What is the definition of "numeric"?
That is a key question.
Obviously we have int, float, and complex. We also have Decimal.
The question is also how do you test for them? Decimal is not a core builtin type. Do we need some kind of __bformat__ protocol? Regards Antoine.
On January 14, 2014 at 4:36:00 PM, Ethan Furman (ethan@stoneleaf.us) wrote:
On 01/14/2014 12:57 PM, Antoine Pitrou wrote:
On Tue, 14 Jan 2014 11:56:25 -0800 Ethan Furman wrote:
%s, because it is the most general, has the most convoluted
resolution:
- input type is bytes? pass it straight through
It should try to get a Py_buffer instead.
Meaning any bytes or bytes-subtype will support the Py_buffer protocol, and this should be the first thing we try?
Sounds good.
For that matter, should the first test be "does this object support Py_buffer" and not worry about it being isinstance(obj, bytes)?
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
What is the definition of "numeric"?
That is a key question.
isinstance(o, numbers.Number) ? Yury
On 15 Jan 2014 07:36, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 01/14/2014 12:57 PM, Antoine Pitrou wrote:
On Tue, 14 Jan 2014 11:56:25 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
%s, because it is the most general, has the most convoluted resolution:
- input type is bytes? pass it straight through
It should try to get a Py_buffer instead.
Meaning any bytes or bytes-subtype will support the Py_buffer protocol,
and this should be the first thing we try?
Sounds good.
For that matter, should the first test be "does this object support
Py_buffer" and not worry about it being isinstance(obj, bytes)? Yep. I actually suggest adjusting the %s handling to: - interpolate Py_buffer exporters directly - interpolate __bytes__ if defined - reject anything with an "encode" method - otherwise interpolate str(obj).encode("ascii")
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
What is the definition of "numeric"?
That is a key question.
As suggested above, I would flip the question and explicitly *disallow* implicit encoding of any object with its own "encode" method, while allowing everything else. Cheers, Nick.
Obviously we have int, float, and complex. We also have Decimal.
But what about Fraction? Or some users numeric class that doesn't
inherit from a core numeric type? Wherever we draw the line, we need to make it's well-documented.
-- ~Ethan~
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 01/14/2014 02:17 PM, Nick Coghlan wrote:
On 15 Jan 2014 07:36, "Ethan Furman" <ethan@stoneleaf.us <mailto:ethan@stoneleaf.us>> wrote:
On 01/14/2014 12:57 PM, Antoine Pitrou wrote:
On Tue, 14 Jan 2014 11:56:25 -0800 Ethan Furman <ethan@stoneleaf.us <mailto:ethan@stoneleaf.us>> wrote:
%s, because it is the most general, has the most convoluted resolution:
- input type is bytes? pass it straight through
It should try to get a Py_buffer instead.
Meaning any bytes or bytes-subtype will support the Py_buffer protocol, and this should be the first thing we try?
Sounds good.
For that matter, should the first test be "does this object support Py_buffer" and not worry about it being isinstance(obj, bytes)?
Yep. I actually suggest adjusting the %s handling to:
- interpolate Py_buffer exporters directly - interpolate __bytes__ if defined - reject anything with an "encode" method - otherwise interpolate str(obj).encode("ascii")
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
What is the definition of "numeric"?
That is a key question.
As suggested above, I would flip the question and explicitly *disallow* implicit encoding of any object with its own "encode" method, while allowing everything else.
Um, int and floats (for example) don't have an .encode method, don't export Py_buffer, don't have a __bytes__ method... Ah! so it would hit the last case, I see. The danger I see with that route is that any ol' object could then make it into the byte stream, and considering what byte streams are for I think we should make the barrier for entry higher than just relying on a __str__ or __repr__. -- ~Ethan~
On 15 Jan 2014 08:23, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 01/14/2014 02:17 PM, Nick Coghlan wrote:
On 15 Jan 2014 07:36, "Ethan Furman" <ethan@stoneleaf.us <mailto:
ethan@stoneleaf.us>> wrote:
On 01/14/2014 12:57 PM, Antoine Pitrou wrote:
On Tue, 14 Jan 2014 11:56:25 -0800 Ethan Furman <ethan@stoneleaf.us <mailto:ethan@stoneleaf.us>> wrote:
%s, because it is the most general, has the most convoluted
resolution:
- input type is bytes? pass it straight through
It should try to get a Py_buffer instead.
Meaning any bytes or bytes-subtype will support the Py_buffer protocol, and this should be the first thing we try?
Sounds good.
For that matter, should the first test be "does this object support Py_buffer" and not worry about it being isinstance(obj, bytes)?
Yep. I actually suggest adjusting the %s handling to:
- interpolate Py_buffer exporters directly - interpolate __bytes__ if defined - reject anything with an "encode" method - otherwise interpolate str(obj).encode("ascii")
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
What is the definition of "numeric"?
That is a key question.
As suggested above, I would flip the question and explicitly *disallow* implicit encoding of any object with its own "encode" method, while allowing everything else.
Um, int and floats (for example) don't have an .encode method, don't export Py_buffer, don't have a __bytes__ method... Ah! so it would hit the last case, I see.
The danger I see with that route is that any ol' object could then make it into the byte stream, and considering what byte streams are for I think we should make the barrier for entry higher than just relying on a __str__ or __repr__.
Yeah, reading the other thread pointed out the issues with this idea (containers in particular are a problem). I think Brett has the right idea: we shouldn't try to accept numbers for %s in binary interpolation. If we limit it to just buffer exporters and objects with a __bytes__ method then the problem goes away. The numeric codes all exist in Python 2, so the porting requirement to the common 2/3 subset will be to update the cases of binary interpolation of a number with %s to use an appropriate numeric formatting code instead. Cheers, Nick.
-- ~Ethan~ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 1/14/2014 2:38 PM, Nick Coghlan wrote:
I think Brett has the right idea: we shouldn't try to accept numbers for %s in binary interpolation. If we limit it to just buffer exporters and objects with a __bytes__ method then the problem goes away.
The numeric codes all exist in Python 2, so the porting requirement to the common 2/3 subset will be to update the cases of binary interpolation of a number with %s to use an appropriate numeric formatting code instead.
+1
On 01/14/2014 05:02 PM, Glenn Linderman wrote:
On 1/14/2014 2:38 PM, Nick Coghlan wrote:
I think Brett has the right idea: we shouldn't try to accept numbers for %s in binary interpolation. If we limit it to just buffer exporters and objects with a __bytes__ method then the problem goes away.
The numeric codes all exist in Python 2, so the porting requirement to the common 2/3 subset will be to update the cases of binary interpolation of a number with %s to use an appropriate numeric formatting code instead.
+1
Agreed, PEP updated. -- ~Ethan~
bytes.format() below. I'll leave it to you to decide if they warrant using, leaving as an open question, or rejecting. On Tue, Jan 14, 2014 at 2:56 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Duh. Here's the text, as well. ;)
PEP: 461 Title: Adding % and {} formatting to bytes Version: $Revision$ Last-Modified: $Date$ Author: Ethan Furman <ethan@stoneleaf.us> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-01-13 Python-Version: 3.5 Post-History: 2014-01-13 Resolution:
Abstract ========
This PEP proposes adding the % and {} formatting operations from str to bytes.
Proposed semantics for bytes formatting =======================================
%-interpolation ---------------
All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers.
Example::
b'%4x' % 10 b' a'
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1.
Example:
>>> b'%c' % 48 b'0'
>>> b'%c' % b'a' b'a'
%s, because it is the most general, has the most convoluted resolution:
- input type is bytes? pass it straight through
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
- input type is something else? use its __bytes__ method; if there isn't one, raise an exception [3]
Examples:
>>> b'%s' % b'abc' b'abc'
>>> b'%s' % 3.14 b'3.14'
>>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
.. note::
Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence::
'a string'.encode('latin-1')
format ------
The format mini language will be used as-is, with the behaviors as listed for %-interpolation.
That's too vague; % interpolation does not support other format operators in the same way as str.format() does. % interpolation has specific code to support %d, etc. But str.format() gets supported for {:d} not from special code but because e.g. float.__format__('d') works. So you can't say "bytes.format() supports {:d} just like %d works with string interpolation" since the mechanisms are fundamentally different. This is why I have argued that if you specify it as "if there is a format spec specified, then the return value from calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free. It also means if you pass in a string that you just want the strict ASCII bytes of then you can get it with {:s}. I also think that a 'b' conversion be added to bytes.format(). This doesn't have the same issue as %b if you make {} implicitly mean {!b} in Python 3.5 as {} will mean what is the most accurate for bytes.format() in either version. It also allows for explicit support where you know you only want a byte and allows {!s} to mean you only want a string (and thus throw an error otherwise). And all of this means that much like %s only taking bytes, the only way for bytes.format() to accept a non-byte argument is for some format spec to be specified to trigger the .encode('ascii', 'strict') call. -Brett
Open Questions ==============
For %s there has been some discussion of trying to use the buffer protocol (Py_buffer) before trying __bytes__. This question should be answered before the PEP is implemented.
Proposed variations ===================
It has been suggested to use %b for bytes instead of %s.
- Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s.
It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.
Foot notes ==========
.. [1] Not sure if this should be the numeric __str__ or the numeric __repr__, or if there's any difference .. [2] Any proper numeric class would then have to provide an ascii representation of its value, either via __repr__ or __str__ (whichever we choose in [1]). .. [3] TypeError, ValueError, or UnicodeEncodeError?
Copyright =========
This document has been placed in the public domain.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ brett%40python.org
On 1/15/2014 9:45 AM, Brett Cannon wrote:
That's too vague; % interpolation does not support other format operators in the same way as str.format() does. % interpolation has specific code to support %d, etc. But str.format() gets supported for {:d} not from special code but because e.g. float.__format__('d') works. So you can't say "bytes.format() supports {:d} just like %d works with string interpolation" since the mechanisms are fundamentally different.
This is why I have argued that if you specify it as "if there is a format spec specified, then the return value from calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free. It also means if you pass in a string that you just want the strict ASCII bytes of then you can get it with {:s}.
I also think that a 'b' conversion be added to bytes.format(). This doesn't have the same issue as %b if you make {} implicitly mean {!b} in Python 3.5 as {} will mean what is the most accurate for bytes.format() in either version. It also allows for explicit support where you know you only want a byte and allows {!s} to mean you only want a string (and thus throw an error otherwise).
And all of this means that much like %s only taking bytes, the only way for bytes.format() to accept a non-byte argument is for some format spec to be specified to trigger the .encode('ascii', 'strict') call.
Agreed. With %-formatting, you can start with the format strings and then decide what you want to do with the passed in objects. But with .format, it's the other way around: you have to look at the passed in objects being formatted, and then decide what the format specifier means to that type. So, for .format, you could say "hey, that object's an int, and I happen to know how to format ints, outside of calling it's .__format__". Or you could even call its __format__ because you know that it will only be ASCII. But to take this approach, you're going to have to hard-code the types. And subclasses are probably out, since there you don't know what the subclass's __format__ will return. It could be non-ASCII.
class Int(int): ... def __format__(self, fmt): ... return u'foo' ... '{}'.format(Int(3)) 'foo'
So basically I think we'll have to hard-code the types that .format() will support, and never call __format__, or only call __format__ if we know that it's a exact type where we know that __format__ will return (strict ASCII). Either that, or we're back to encoding the result of __format__ and accepting that sometimes it might throw errors, depending on the values being passed into format(). Eric.
On Wed, Jan 15, 2014 at 10:52 AM, Eric V. Smith <eric@trueblade.com> wrote:
On 1/15/2014 9:45 AM, Brett Cannon wrote:
That's too vague; % interpolation does not support other format operators in the same way as str.format() does. % interpolation has specific code to support %d, etc. But str.format() gets supported for {:d} not from special code but because e.g. float.__format__('d') works. So you can't say "bytes.format() supports {:d} just like %d works with string interpolation" since the mechanisms are fundamentally different.
This is why I have argued that if you specify it as "if there is a format spec specified, then the return value from calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free. It also means if you pass in a string that you just want the strict ASCII bytes of then you can get it with {:s}.
I also think that a 'b' conversion be added to bytes.format(). This doesn't have the same issue as %b if you make {} implicitly mean {!b} in Python 3.5 as {} will mean what is the most accurate for bytes.format() in either version. It also allows for explicit support where you know you only want a byte and allows {!s} to mean you only want a string (and thus throw an error otherwise).
And all of this means that much like %s only taking bytes, the only way for bytes.format() to accept a non-byte argument is for some format spec to be specified to trigger the .encode('ascii', 'strict') call.
Agreed. With %-formatting, you can start with the format strings and then decide what you want to do with the passed in objects. But with .format, it's the other way around: you have to look at the passed in objects being formatted, and then decide what the format specifier means to that type.
So, for .format, you could say "hey, that object's an int, and I happen to know how to format ints, outside of calling it's .__format__". Or you could even call its __format__ because you know that it will only be ASCII. But to take this approach, you're going to have to hard-code the types. And subclasses are probably out, since there you don't know what the subclass's __format__ will return. It could be non-ASCII.
class Int(int): ... def __format__(self, fmt): ... return u'foo' ... '{}'.format(Int(3)) 'foo'
So basically I think we'll have to hard-code the types that .format() will support, and never call __format__, or only call __format__ if we know that it's a exact type where we know that __format__ will return (strict ASCII).
Either that, or we're back to encoding the result of __format__ and accepting that sometimes it might throw errors, depending on the values being passed into format().
I say accept that an error might get thrown as there is precedent of specifying a format spec that an object's __format__() method doesn't recognize::
'{:s}'.format(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: Unknown format code 's' for object of type 'int'
IOW I'm actively trying to avoid type-restricting the semantics for bytes.format() for a consistent, clear mental model. Remembering that "any format spec leads to calling .encode('ascii', 'strict') on the result" is simple compared to "ASCII bytes will be returned for ints and floats when passed in, otherwise all other types follow these rules". As the zen says: Errors should never pass silently. Special cases aren't special enough to break the rules. -Brett
On 1/15/2014 7:52 AM, Eric V. Smith wrote:
So basically I think we'll have to hard-code the types that .format() will support, and never call __format__, or only call __format__ if we know that it's a exact type where we know that __format__ will return (strict ASCII).
Either that, or we're back to encoding the result of __format__ and accepting that sometimes it might throw errors, depending on the values being passed into format().
Looks like you need to invent __formatb__ to produce only ASCII. Objects that have __formatb__ can be formatted by bytes.format. To avoid coding, it could be possible that __formatb__ might be a callable, in which case it is called to get the result, or not a callable, in which case one calls __format__ and converts the result to ASCII, __formatb__ just indicating a guarantee that only ASCII will result. Or it could be that __formatb__ replaces __format__ and str.__format__, if it finds no __format__ looks for __formatb__, calls that, and converts the result to Unicode.
Glenn Linderman <v+python@g.nevcal.com> wrote:
On 1/15/2014 7:52 AM, Eric V. Smith wrote:
Either that, or we're back to encoding the result of __format__ and accepting that sometimes it might throw errors, depending on the values being passed into format().
That would take us back to Python 2 hell. Please no. I don't like checking for types either, we should have a special method.
Looks like you need to invent __formatb__ to produce only ASCII. Objects that have __formatb__ can be formatted by bytes.format. To avoid coding, it could be possible that __formatb__ might be a callable in which case it is called to get the result, or not a callable, in which case one calls __format__ and converts the result to ASCII, __formatb__ just indicating a guarantee that only ASCII will result.
Just do: def __formatb__(self, spec): return MyClass.__format__(self, spec).encode('ascii') Note that I think it is better to explicitly use the __format__ method rather than using self.__format__. My reasoning is that a subclass might implement a __format__ that returns non-ASCII characters. We don't need a special bytes version of __str__ since the %-operator can call __formatb__ with the correct format spec. Neil
On Wed, Jan 15, 2014 at 10:57 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/15/2014 06:45 AM, Brett Cannon wrote:
bytes.format() below. I'll leave it to you to decide if they warrant using, leaving as an open question, or rejecting.
Thanks for your comments. I've only barely touched format, so it's not an area of strength for me. :)
Time to strengthen it if you are proposing a PEP that is going to affect it. =)
-- ~Ethan~
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ brett%40python.org
On 01/15/2014 08:51 AM, Brett Cannon wrote:
On Wed, Jan 15, 2014 at 10:57 AM, Ethan Furman wrote:
Thanks for your comments. I've only barely touched format, so it's not an area of strength for me. :)
Time to strengthen it if you are proposing a PEP that is going to affect it. =)
I am. You're helping. :) -- ~Ethan~
On 01/15/2014 06:45 AM, Brett Cannon wrote:
I also think that a 'b' conversion be added to bytes.format(). This doesn't have the same issue as %b if you make {} implicitly mean {!b} in Python 3.5 as {} will mean what is the most accurate for bytes.format() in either version. It also allows for explicit support where you know you only want a byte and allows {!s} to mean you only want a string (and thus throw an error otherwise).
Given that !b does not exist in Py2, !s (like %s) has to mean bytes when working with a byte stream. Given that, !s and !b would mean the same thing, so it worth adding !b? -- ~Ethan~
On Wed, Jan 15, 2014 at 4:24 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/15/2014 06:45 AM, Brett Cannon wrote:
I also think that a 'b' conversion be added to bytes.format(). This doesn't have the same issue as %b if you make {} implicitly mean {!b} in Python 3.5 as {} will mean what is the most accurate for bytes.format() in either version. It also allows for explicit support where you know you only want a byte and allows {!s} to mean you only want a string (and thus throw an error otherwise).
Given that !b does not exist in Py2, !s (like %s) has to mean bytes when working with a byte stream. Given that, !s and !b would mean the same thing, so it worth adding !b?
I disagree with the assertion. %s has to mean bytes for Python 2 compatibility because there is no equivalent to '{}' (no conversion or format spec specified); basically %s represents "no conversion" for the % operator. But since format() has the concept of a default conversion as well as explicit conversions you can lean on that fact and let the default conversion do what makes sense for that version of Python.
On 01/15/2014 06:45 AM, Brett Cannon wrote: The PEP currently says::
format ------
The format mini language will be used as-is, with the behaviors as listed for %-interpolation.
That's too vague; % interpolation does not support other format operators in the same way as str.format() does. % interpolation has specific code to support %d, etc. But str.format() gets supported for {:d} not from special code but because e.g. float.__format__('d') works. So you can't say "bytes.format() supports {:d} just like %d works with string interpolation" since the mechanisms are fundamentally different.
A question for anyone that has extensive experience in both %-formatting and .format-formatting: Would it be possible, at least for int and float, to take whatever is in the specifier and convert to %? Example: "Weight: {wgt:-07f}".format(wgt=137.23) would take the "-07f" and basically do a "%-07f" % 137.23 to get the ASCII to use? -- ~Ethan~
On 1/15/2014 4:32 PM, Ethan Furman wrote:
A question for anyone that has extensive experience in both %-formatting and .format-formatting: Would it be possible, at least for int and float, to take whatever is in the specifier and convert to %? Example:
"Weight: {wgt:-07f}".format(wgt=137.23)
would take the "-07f" and basically do a "%-07f" % 137.23 to get the ASCII to use?
I think the int.__format__ version might be a superset. Specifically, the "n" and "%" types. There may well be others. But I think we could say we're not going to support these in b"".format().
On 01/15/2014 06:45 AM, Brett Cannon wrote:
This is why I have argued that if you specify it as "if there is a format spec specified, then the return value from calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free.
It may work like this under the hood, but it's an implementation detail. Since the numeric format codes will call int, index, or float on the object (to handle subclasses), we could then call __format__ on the resulting int or float to do the heavy lifting; but since __format__ on anything else would never be called I don't want to give that impression.
It also means if you pass in a string that you just want the strict ASCII bytes of then you can get it with {:s}.
This isn't going to happen. If the user wants a string to be in the byte stream, it has to either be a bytes literal or explicitly encoded [1]. -- ~Ethan~ [1] Apologies if this has already been answered. I wanted to make sure I responded to all the ideas/objects, and I may have responded more than once to some. It's been a long few threads. ;)
On 16 Jan 2014 17:53, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 01/15/2014 06:45 AM, Brett Cannon wrote:
This is why I have argued that if you specify it as "if there is a
format spec specified, then the return value from
calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free.
It may work like this under the hood, but it's an implementation detail. Since the numeric format codes will call int, index, or float on the object (to handle subclasses), we could then call __format__ on the resulting int or float to do the heavy lifting; but since __format__ on anything else would never be called I don't want to give that impression.
I have a different proposal: let's *just* add mod formatting to bytes, and leave the extensible formatting system as a text only operation. We don't really care if bytes supports that method for version compatibility purposes, and the deliberate flexibility of the design makes it hard to translate into the binary domain. So let's just not provide that - let's accept that, for the binary domain, printf style formatting is just a better fit for the job :) Cheers, Nick.
It also means if you pass in a string that you just want the strict
ASCII bytes
of then you can get it with {:s}.
This isn't going to happen. If the user wants a string to be in the byte stream, it has to either be a bytes literal or explicitly encoded [1].
-- ~Ethan~
[1] Apologies if this has already been answered. I wanted to make sure I responded to all the ideas/objects, and I may have responded more than once to some. It's been a long few threads. ;)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Thu, Jan 16, 2014 at 4:56 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 16 Jan 2014 17:53, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 01/15/2014 06:45 AM, Brett Cannon wrote:
This is why I have argued that if you specify it as "if there is a
format spec specified, then the return value from
calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free.
It may work like this under the hood, but it's an implementation detail. Since the numeric format codes will call int, index, or float on the object (to handle subclasses), we could then call __format__ on the resulting int or float to do the heavy lifting; but since __format__ on anything else would never be called I don't want to give that impression.
I have a different proposal: let's *just* add mod formatting to bytes, and leave the extensible formatting system as a text only operation.
We don't really care if bytes supports that method for version compatibility purposes, and the deliberate flexibility of the design makes it hard to translate into the binary domain.
So let's just not provide that - let's accept that, for the binary domain, printf style formatting is just a better fit for the job :)
Or PEP 460 for bytes.format() and PEP 461 for %. -Brett
Cheers, Nick.
It also means if you pass in a string that you just want the strict
ASCII bytes
of then you can get it with {:s}.
This isn't going to happen. If the user wants a string to be in the byte stream, it has to either be a bytes literal or explicitly encoded [1].
-- ~Ethan~
[1] Apologies if this has already been answered. I wanted to make sure I responded to all the ideas/objects, and I may have responded more than once to some. It's been a long few threads. ;)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org
On Thu, Jan 16, 2014 at 2:51 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/15/2014 06:45 AM, Brett Cannon wrote:
This is why I have argued that if you specify it as "if there is a format spec specified, then the return value from calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free.
It may work like this under the hood, but it's an implementation detail.
I'm arguing it's not an implementation detail but a definition of how bytes.format() would work.
Since the numeric format codes will call int, index, or float on the object (to handle subclasses),
But that's **only** because the numeric types choose to as part of their __format__() implementation; it is not inherent to str.format().
we could then call __format__ on the resulting int or float to do the heavy lifting;
It's not just the heavy lifting; it does **all** the lifting for format specifications.
but since __format__ on anything else would never be called I don't want to give that impression.
Fine, if you're worried about bytes.format() overstepping by implicitly calling str.encode() on the return value of __format__() then you will need __bytes__format__() to get equivalent support. -Brett
It also means if you pass in a string that you just want the strict ASCII
bytes of then you can get it with {:s}.
This isn't going to happen. If the user wants a string to be in the byte stream, it has to either be a bytes literal or explicitly encoded [1].
-- ~Ethan~
[1] Apologies if this has already been answered. I wanted to make sure I responded to all the ideas/objects, and I may have responded more than once to some. It's been a long few threads. ;)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ brett%40python.org
On 01/16/2014 06:45 AM, Brett Cannon wrote:
On Thu, Jan 16, 2014 at 2:51 AM, Ethan Furman wrote:
On 01/15/2014 06:45 AM, Brett Cannon wrote:
This is why I have argued that if you specify it as "if there is a format spec specified, then the return value from calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free.
Since the numeric format codes will call int, index, or float on the object (to handle subclasses),
But that's **only** because the numeric types choose to as part of their __format__() implementation; it is not inherent to str.format().
As I understand it, str.format will call the object's __format__. So, for example, if I say: u'the value is: %d' % myNum(17) then it will be myNum.__format__ that gets called, not int.__format__; this is precisely what we don't want, since can't know that myNum is only going to return ASCII characters. This is why I would have bytes.__format__, as part of its parsing, call int, index, or float depending on the format code; so the above example would have bytes.__format__ calling int() on myNum(17), at which point we either have an int type or an exception was raised because myNum isn't really an integer. Once we have an int, whose format we know and trust, then we can call its __format__ and proceed from there. On the flip side, if myNum does define it's own __format__, it will not be called by bytes.format, and perhaps that is another good reason for bytes to only support %-interpolation and not format? -- ~Ethan~
On 01/16/2014 11:23 AM, Ethan Furman wrote:
On 01/16/2014 06:45 AM, Brett Cannon wrote:
On Thu, Jan 16, 2014 at 2:51 AM, Ethan Furman wrote:
On 01/15/2014 06:45 AM, Brett Cannon wrote:
This is why I have argued that if you specify it as "if there is a format spec specified, then the return value from calling __format__() will have str.decode('ascii', 'strict') called on it" you get the support for the various number-specific format specs for free.
Since the numeric format codes will call int, index, or float on the object (to handle subclasses),
But that's **only** because the numeric types choose to as part of their __format__() implementation; it is not inherent to str.format().
As I understand it, str.format will call the object's __format__. So, for example, if I say:
u'the value is: %d' % myNum(17)
then it will be myNum.__format__ that gets called, not int.__format__; this is precisely what we don't want, since can't know that myNum is only going to return ASCII characters.
"Magic" methods, including __format__, are called on the type, not the instance.
This is why I would have bytes.__format__, as part of its parsing, call int, index, or float depending on the format code; so the above example would have bytes.__format__ calling int() on myNum(17), at which point we either have an int type or an exception was raised because myNum isn't really an integer. Once we have an int, whose format we know and trust, then we can call its __format__ and proceed from there.
On the flip side, if myNum does define it's own __format__, it will not be called by bytes.format, and perhaps that is another good reason for bytes to only support %-interpolation and not format?
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up). Eric.
On 01/16/2014 10:30 AM, Eric V. Smith wrote:
On 01/16/2014 11:23 AM, Ethan Furman wrote:
On 01/16/2014 06:45 AM, Brett Cannon wrote:
But that's **only** because the numeric types choose to as part of their __format__() implementation; it is not inherent to str.format().
As I understand it, str.format will call the object's __format__. So, for example, if I say:
u'the value is: %d' % myNum(17)
then it will be myNum.__format__ that gets called, not int.__format__; this is precisely what we don't want, since can't know that myNum is only going to return ASCII characters.
"Magic" methods, including __format__, are called on the type, not the instance.
Yes, that's why I said `myNum(17)` and not `myNum`.
This is why I would have bytes.__format__, as part of its parsing, call int, index, or float depending on the format code; so the above example would have bytes.__format__ calling int() on myNum(17), at which point we either have an int type or an exception was raised because myNum isn't really an integer. Once we have an int, whose format we know and trust, then we can call its __format__ and proceed from there.
On the flip side, if myNum does define it's own __format__, it will not be called by bytes.format, and perhaps that is another good reason for bytes to only support %-interpolation and not format?
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
That can certainly be our fallback position if we can't decide now how we want to handle int and float subclasses. -- ~Ethan~
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com> wrote:
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
+1 -eric
On 17 Jan 2014 18:03, "Eric Snow" <ericsnowcurrently@gmail.com> wrote:
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com>
wrote:
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
+1
Please don't make me learn the limitations of a new mini language without a really good reason. For the sake of argument, assume we have a Python 3.5 with bytes.__mod__ restored roughly as described in PEP 461. *Given* that feature set, what is the rationale for *adding* bytes.format? What new capabilities will it provide that aren't already covered by printf-style interpolation directly to bytes or text formatting followed by encoding the result? Cheers, Nick.
-eric _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 1/17/2014 6:42 AM, Nick Coghlan wrote:
On 17 Jan 2014 18:03, "Eric Snow" <ericsnowcurrently@gmail.com <mailto:ericsnowcurrently@gmail.com>> wrote:
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com
<mailto:eric@trueblade.com>> wrote:
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
+1
Please don't make me learn the limitations of a new mini language without a really good reason.
For the sake of argument, assume we have a Python 3.5 with bytes.__mod__ restored roughly as described in PEP 461. *Given* that feature set, what is the rationale for *adding* bytes.format? What new capabilities will it provide that aren't already covered by printf-style interpolation directly to bytes or text formatting followed by encoding the result?
The only reason to add any of this, in my mind, is to ease porting of 2.x code. If my proposal covers most of the cases of b''.format() that exist in 2.x code that wants to move to 3.5, then I think it's worth doing. Is there any such code that's blocked from porting by the lack of b''.format() that supports bytes, int, and float? I don't know. I concede that it's unlikely. IF this were a feature that we were going to add to 3.5 on its own merits, I think we add __format_ascii__ and make the whole thing extensible. Is there any new code that's blocked from being written by missing b"".format()? I don't know that, either. Eric.
On 01/17/2014 07:34 AM, Eric V. Smith wrote:
On 1/17/2014 6:42 AM, Nick Coghlan wrote:
On 17 Jan 2014 18:03, "Eric Snow" <ericsnowcurrently@gmail.com <mailto:ericsnowcurrently@gmail.com>> wrote:
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com
<mailto:eric@trueblade.com>> wrote:
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
+1
Please don't make me learn the limitations of a new mini language without a really good reason.
For the sake of argument, assume we have a Python 3.5 with bytes.__mod__ restored roughly as described in PEP 461. *Given* that feature set, what is the rationale for *adding* bytes.format? What new capabilities will it provide that aren't already covered by printf-style interpolation directly to bytes or text formatting followed by encoding the result?
The only reason to add any of this, in my mind, is to ease porting of 2.x code. If my proposal covers most of the cases of b''.format() that exist in 2.x code that wants to move to 3.5, then I think it's worth doing. Is there any such code that's blocked from porting by the lack of b''.format() that supports bytes, int, and float? I don't know. I concede that it's unlikely.
IF this were a feature that we were going to add to 3.5 on its own merits, I think we add __format_ascii__ and make the whole thing extensible. Is there any new code that's blocked from being written by missing b"".format()? I don't know that, either.
Following up, I think this leaves us with 3 choices: 1. Do not implement bytes.format(). We tell any 2.x code that's written to use str.format() to switch to %-formatting for their common code base. 2. Add the simplistic version of bytes.format() that I describe above, restricted to accepting bytes, int, and float (and no subclasses). Some 2.x code will work, some will need to change to %-formatting. 3. Add bytes.format() and the __format_ascii__ protocol. We might want to also add a format_ascii() builtin, to match __format__ and format(). This would require the least change to 2.x code that uses str.format() and wants to move to bytes.format(), but would require some work on the 3.x side. I'd advocate 1 or 2. Eric.
On 17/01/2014 14:50, Eric V. Smith wrote:
On 01/17/2014 07:34 AM, Eric V. Smith wrote:
On 1/17/2014 6:42 AM, Nick Coghlan wrote:
On 17 Jan 2014 18:03, "Eric Snow" <ericsnowcurrently@gmail.com <mailto:ericsnowcurrently@gmail.com>> wrote:
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com
<mailto:eric@trueblade.com>> wrote:
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
+1
Please don't make me learn the limitations of a new mini language without a really good reason.
For the sake of argument, assume we have a Python 3.5 with bytes.__mod__ restored roughly as described in PEP 461. *Given* that feature set, what is the rationale for *adding* bytes.format? What new capabilities will it provide that aren't already covered by printf-style interpolation directly to bytes or text formatting followed by encoding the result?
The only reason to add any of this, in my mind, is to ease porting of 2.x code. If my proposal covers most of the cases of b''.format() that exist in 2.x code that wants to move to 3.5, then I think it's worth doing. Is there any such code that's blocked from porting by the lack of b''.format() that supports bytes, int, and float? I don't know. I concede that it's unlikely.
IF this were a feature that we were going to add to 3.5 on its own merits, I think we add __format_ascii__ and make the whole thing extensible. Is there any new code that's blocked from being written by missing b"".format()? I don't know that, either.
Following up, I think this leaves us with 3 choices:
1. Do not implement bytes.format(). We tell any 2.x code that's written to use str.format() to switch to %-formatting for their common code base.
2. Add the simplistic version of bytes.format() that I describe above, restricted to accepting bytes, int, and float (and no subclasses). Some 2.x code will work, some will need to change to %-formatting.
3. Add bytes.format() and the __format_ascii__ protocol. We might want to also add a format_ascii() builtin, to match __format__ and format(). This would require the least change to 2.x code that uses str.format() and wants to move to bytes.format(), but would require some work on the 3.x side.
I'd advocate 1 or 2.
Eric.
For both options 1 and 2 surely you cannot be suggesting that after people have written 2.x code to use format() as %f formatting is to be deprecated, they now have to change the code back to the way they may well have written it in the first place? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On 01/17/2014 10:15 AM, Mark Lawrence wrote:
On 17/01/2014 14:50, Eric V. Smith wrote:
On 01/17/2014 07:34 AM, Eric V. Smith wrote:
On 1/17/2014 6:42 AM, Nick Coghlan wrote:
On 17 Jan 2014 18:03, "Eric Snow" <ericsnowcurrently@gmail.com <mailto:ericsnowcurrently@gmail.com>> wrote:
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com
<mailto:eric@trueblade.com>> wrote:
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this will cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
+1
Please don't make me learn the limitations of a new mini language without a really good reason.
For the sake of argument, assume we have a Python 3.5 with bytes.__mod__ restored roughly as described in PEP 461. *Given* that feature set, what is the rationale for *adding* bytes.format? What new capabilities will it provide that aren't already covered by printf-style interpolation directly to bytes or text formatting followed by encoding the result?
The only reason to add any of this, in my mind, is to ease porting of 2.x code. If my proposal covers most of the cases of b''.format() that exist in 2.x code that wants to move to 3.5, then I think it's worth doing. Is there any such code that's blocked from porting by the lack of b''.format() that supports bytes, int, and float? I don't know. I concede that it's unlikely.
IF this were a feature that we were going to add to 3.5 on its own merits, I think we add __format_ascii__ and make the whole thing extensible. Is there any new code that's blocked from being written by missing b"".format()? I don't know that, either.
Following up, I think this leaves us with 3 choices:
1. Do not implement bytes.format(). We tell any 2.x code that's written to use str.format() to switch to %-formatting for their common code base.
2. Add the simplistic version of bytes.format() that I describe above, restricted to accepting bytes, int, and float (and no subclasses). Some 2.x code will work, some will need to change to %-formatting.
3. Add bytes.format() and the __format_ascii__ protocol. We might want to also add a format_ascii() builtin, to match __format__ and format(). This would require the least change to 2.x code that uses str.format() and wants to move to bytes.format(), but would require some work on the 3.x side.
I'd advocate 1 or 2.
Eric.
For both options 1 and 2 surely you cannot be suggesting that after people have written 2.x code to use format() as %f formatting is to be deprecated, they now have to change the code back to the way they may well have written it in the first place?
That would be part of it, yes. Otherwise you need #3. This is all assuming we've ruled out an option 4, because of the exceptions raised depending on what __format__ does: 4. Add bytes.format(), have it convert the format specifier to str (unicode), call __format__ and encode the result back to ASCII. Accept that there will be data-driven exceptions depending on the result of the __format__ call. I'm open to other ideas. Eric.
On 01/17/2014 10:24 AM, Eric V. Smith wrote:
On 01/17/2014 10:15 AM, Mark Lawrence wrote:
On 17/01/2014 14:50, Eric V. Smith wrote:
On 01/17/2014 07:34 AM, Eric V. Smith wrote:
On 1/17/2014 6:42 AM, Nick Coghlan wrote:
On 17 Jan 2014 18:03, "Eric Snow" <ericsnowcurrently@gmail.com <mailto:ericsnowcurrently@gmail.com>> wrote:
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com
<mailto:eric@trueblade.com>> wrote:
> For the first iteration of bytes.format(), I think we should just > support the exact types of int, float, and bytes. It will call the > type's__format__ (with the object as "self") and encode the result to > ASCII. For the stated use case of 2.x compatibility, I suspect > this will > cover > 90% of the uses in real code. If we find there are cases > where > real code needs additional types supported, we can consider adding > __format_ascii__ (or whatever name we cook up).
+1
Please don't make me learn the limitations of a new mini language without a really good reason.
For the sake of argument, assume we have a Python 3.5 with bytes.__mod__ restored roughly as described in PEP 461. *Given* that feature set, what is the rationale for *adding* bytes.format? What new capabilities will it provide that aren't already covered by printf-style interpolation directly to bytes or text formatting followed by encoding the result?
The only reason to add any of this, in my mind, is to ease porting of 2.x code. If my proposal covers most of the cases of b''.format() that exist in 2.x code that wants to move to 3.5, then I think it's worth doing. Is there any such code that's blocked from porting by the lack of b''.format() that supports bytes, int, and float? I don't know. I concede that it's unlikely.
IF this were a feature that we were going to add to 3.5 on its own merits, I think we add __format_ascii__ and make the whole thing extensible. Is there any new code that's blocked from being written by missing b"".format()? I don't know that, either.
Following up, I think this leaves us with 3 choices:
1. Do not implement bytes.format(). We tell any 2.x code that's written to use str.format() to switch to %-formatting for their common code base.
2. Add the simplistic version of bytes.format() that I describe above, restricted to accepting bytes, int, and float (and no subclasses). Some 2.x code will work, some will need to change to %-formatting.
3. Add bytes.format() and the __format_ascii__ protocol. We might want to also add a format_ascii() builtin, to match __format__ and format(). This would require the least change to 2.x code that uses str.format() and wants to move to bytes.format(), but would require some work on the 3.x side.
For #3, hopefully this "additional work" on the 3.x side would just be to add, to each class where you already have a custom __format__ used for b''.format(), code like: def __format_ascii__(self, fmt): return self.__format__(fmt.decode()).encode('ascii') That is, we're pushing the possibility of having to deal with an encoding exception off to the type, instead of having it live in bytes.format(). And to agree with Ethan: %-formatting isn't deprecated. Eric.
On 17 January 2014 15:50, Eric V. Smith <eric@trueblade.com> wrote:
For #3, hopefully this "additional work" on the 3.x side would just be to add, to each class where you already have a custom __format__ used for b''.format(), code like:
def __format_ascii__(self, fmt): return self.__format__(fmt.decode()).encode('ascii')
For me, the big cost would seem to be in the necessary documentation, explaining the new special method in the language reference, explaining the 2 different forms of format() in the built in types docs. And the conceptual overhead of another special method for people to be aware of. If I implement my own number subclass, do I need to implement __format_ascii__? My gut feeling is that we simply don't implement format() for bytes. I don't see sufficient benefit, if %-formatting is available. Paul.
On 18 Jan 2014 02:08, "Paul Moore" <p.f.moore@gmail.com> wrote:
On 17 January 2014 15:50, Eric V. Smith <eric@trueblade.com> wrote:
For #3, hopefully this "additional work" on the 3.x side would just be to add, to each class where you already have a custom __format__ used for b''.format(), code like:
def __format_ascii__(self, fmt): return self.__format__(fmt.decode()).encode('ascii')
For me, the big cost would seem to be in the necessary documentation, explaining the new special method in the language reference, explaining the 2 different forms of format() in the built in types docs. And the conceptual overhead of another special method for people to be aware of. If I implement my own number subclass, do I need to implement __format_ascii__?
My gut feeling is that we simply don't implement format() for bytes. I don't see sufficient benefit, if %-formatting is available.
Exactly, it's the documentation problem to explain "when would I recommend using this over the alternatives?" that turns me off the idea of general purpose bytes formatting. printf style covers the use cases we have identified, and the code bases of immediate interest support 2.5 or earlier and thus *must* be using printf-style formatting. Add to that the fact that to maintain the Python 3 text model, we either have to gut it to the point where it has very few of the benefits the text version offers printf-style formatting, or else we introduce a whole new protocol for a feature that we consider so borderline that it took us six Python 3 releases to add it back to the language. By contrast, the following model is relatively easy to document: * printf-style is low level and relatively inflexible, but available for both text and for ASCII compatible segments in binary data. The %s formatting code accepts arbitrary objects (using str) in text mode, but only buffer exporters and objects with a __bytes__ method in binary mode. * the format is high level and very flexible, but available only for text - the result must be explicitly encoded to binary if that is needed. Cheers, Nick.
Paul. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 17/01/2014 15:41, Ethan Furman wrote:
On 01/17/2014 07:15 AM, Mark Lawrence wrote:
For both options 1 and 2 surely you cannot be suggesting that after people have written 2.x code to use format() as %f formatting is to be deprecated
%f formatting is not deprecated, and will not be in 3.x's lifetime.
-- ~Ethan~
I'm sorry, I got the above wrong, I should have said "was to be deprecated" :( -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On 1/17/2014 7:15 AM, Mark Lawrence wrote:
For both options 1 and 2 surely you cannot be suggesting that after people have written 2.x code to use format() as %f formatting is to be deprecated, they now have to change the code back to the way they may well have written it in the first place?
If they are committed to format(), another option is to operate in the Unicode domain, and encode at the end.
On 01/17/2014 02:04 PM, Glenn Linderman wrote:
On 1/17/2014 7:15 AM, Mark Lawrence wrote:
For both options 1 and 2 surely you cannot be suggesting that after people have written 2.x code to use format() as %f formatting is to be deprecated, they now have to change the code back to the way they may well have written it in the first place?
If they are committed to format(), another option is to operate in the Unicode domain, and encode at the end.
Maybe that's the best advice to give. It's better than my earlier example of field-at-a-time encoding. Eric.
On 1/17/2014 10:15 AM, Mark Lawrence wrote:
For both options 1 and 2 surely you cannot be suggesting that after people have written 2.x code to use format() as %f formatting is to be deprecated,
I will not be for at least a decade.
they now have to change the code back to the way they may well have written it in the first place?
I would suggest that people simply .encode the result if bytes are needed in 3.x as well as 2.x. Polyglot code will likely have a 'py3' boolean already to make the encoding conditional. -- Terry Jan Reedy
On 18 Jan 2014 06:19, "Terry Reedy" <tjreedy@udel.edu> wrote:
On 1/17/2014 10:15 AM, Mark Lawrence wrote:
For both options 1 and 2 surely you cannot be suggesting that after people have written 2.x code to use format() as %f formatting is to be deprecated,
I will not be for at least a decade.
It will not be deprecated, period. Originally, we thought that the introduction of the new flexible text formatting system made printf-style formatting redundant. After running both in parallel for a while, we learned we were wrong: - it's far more difficult than we originally anticipated to migrate away from it to the new text formatting system - in particular, the lazy interpolation support in the logging module (and similar systems) has no reasonable migration path - two different core interpolation systems make it much easier to interpolate into format strings - it's a better fit for code which needs to semantically align with C - it's a useful micro-optimisation - as the current discussion shows, it's much better suited to the interpolation of ASCII compatible segments in binary data formats Do many of the core devs strongly prefer the new formatting system? Yes. Were we originally planning to deprecate and remove the printf-style formatting system? Yes. Are there still any plans to do so? No. That's why we rewrote the relevant docs to always describe it as "mod formatting" or "printf-style formatting", rather than "legacy" or "old-style". If there are any instances (or even implications) of the latter left in the official docs, that's a bug to be fixed. Perhaps this needs to be a new Q in my Python 3 Q&A, since a lot of people still seem to have the wrong idea... Regards, Nick.
they now have to change the code back to the way they may well have written it in the first place?
I would suggest that people simply .encode the result if bytes are needed
in 3.x as well as 2.x. Polyglot code will likely have a 'py3' boolean already to make the encoding conditional.
-- Terry Jan Reedy
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Fri, Jan 17, 2014 at 9:50 AM, Eric V. Smith <eric@trueblade.com> wrote:
On 01/17/2014 07:34 AM, Eric V. Smith wrote:
On 1/17/2014 6:42 AM, Nick Coghlan wrote:
On 17 Jan 2014 18:03, "Eric Snow" <ericsnowcurrently@gmail.com <mailto:ericsnowcurrently@gmail.com>> wrote:
On Thu, Jan 16, 2014 at 11:30 AM, Eric V. Smith <eric@trueblade.com
<mailto:eric@trueblade.com>> wrote:
For the first iteration of bytes.format(), I think we should just support the exact types of int, float, and bytes. It will call the type's__format__ (with the object as "self") and encode the result to ASCII. For the stated use case of 2.x compatibility, I suspect this
will
cover > 90% of the uses in real code. If we find there are cases where real code needs additional types supported, we can consider adding __format_ascii__ (or whatever name we cook up).
+1
Please don't make me learn the limitations of a new mini language without a really good reason.
For the sake of argument, assume we have a Python 3.5 with bytes.__mod__ restored roughly as described in PEP 461. *Given* that feature set, what is the rationale for *adding* bytes.format? What new capabilities will it provide that aren't already covered by printf-style interpolation directly to bytes or text formatting followed by encoding the result?
The only reason to add any of this, in my mind, is to ease porting of 2.x code. If my proposal covers most of the cases of b''.format() that exist in 2.x code that wants to move to 3.5, then I think it's worth doing. Is there any such code that's blocked from porting by the lack of b''.format() that supports bytes, int, and float? I don't know. I concede that it's unlikely.
IF this were a feature that we were going to add to 3.5 on its own merits, I think we add __format_ascii__ and make the whole thing extensible. Is there any new code that's blocked from being written by missing b"".format()? I don't know that, either.
Following up, I think this leaves us with 3 choices:
1. Do not implement bytes.format(). We tell any 2.x code that's written to use str.format() to switch to %-formatting for their common code base.
+1 I would rephrase it to "switch to %-formatting for bytes usage for their common code base". If they are working with actual text then using str.format() still works (and is actually nicer to use IMO). It actually might make the str/bytes relationship even clearer, especially if we start to promote that str.format() is for text and %-formatting is for bytes.
2. Add the simplistic version of bytes.format() that I describe above, restricted to accepting bytes, int, and float (and no subclasses). Some 2.x code will work, some will need to change to %-formatting.
-1 I am still not comfortable with the special-casing by type for bytes.format().
3. Add bytes.format() and the __format_ascii__ protocol. We might want to also add a format_ascii() builtin, to match __format__ and format(). This would require the least change to 2.x code that uses str.format() and wants to move to bytes.format(), but would require some work on the 3.x side.
+0 Would allow for easy porting and it's general enough, but I don't know if working with bytes really requires this much beyond supporting the porting story. I'm still +1 on PEP 460 for bytes.format() as a nice way to simplify basic bytes usage in Python 3, but if that's not accepted then I say just drop bytes.format() entirely and let %-formatting be the way people do Python 2/3 bytes work (if they are not willing to build it up from scratch like they already can do). -Brett
On Fri, Jan 17, 2014 at 11:16 AM, Barry Warsaw <barry@python.org> wrote:
On Jan 17, 2014, at 11:00 AM, Brett Cannon wrote:
I would rephrase it to "switch to %-formatting for bytes usage for their common code base".
-1. %-formatting is so neanderthal. :)
Very much so, which is why I'm willing to let it be bastardized in Python 3.5 for the sake of porting but not bytes.format(). =) I'm keeping format() clean for my nieces and nephew to use; they can just turn their nose up at %-formatting when they are old enough to program.
On 01/17/2014 11:58 AM, Brett Cannon wrote:
On Fri, Jan 17, 2014 at 11:16 AM, Barry Warsaw <barry@python.org <mailto:barry@python.org>> wrote:
On Jan 17, 2014, at 11:00 AM, Brett Cannon wrote:
>I would rephrase it to "switch to %-formatting for bytes usage for their >common code base".
-1. %-formatting is so neanderthal. :)
Very much so, which is why I'm willing to let it be bastardized in Python 3.5 for the sake of porting but not bytes.format(). =) I'm keeping format() clean for my nieces and nephew to use; they can just turn their nose up at %-formatting when they are old enough to program.
Given the problems with implementing it, I'm more than willing to drop bytes.format() from PEP 461 (not that it's my PEP). But if we think that %-formatting is neanderthal and will get dropped in the Python 4000 timeframe (that is, someday in the far future), then I think we should have some advice to give to people who are writing new 3.x code for the non-porting use-cases addressed by the PEP. I'm specifically thinking of new code that wants to format some bytes for an on-the-wire ascii-like protocol. Is it: b'Content-Length: ' + str(47).encode('ascii') or b'Content-Length: {}.format(str(47).encode('ascii')) or something better? I think it will look like the above, or involve something like bytes.format() and __format_ascii__. Or, maybe a library that just supports a few types (say, bytes, int, and float!). Eric.
On 01/17/2014 09:13 AM, Eric V. Smith wrote:
On 01/17/2014 11:58 AM, Brett Cannon wrote:
On Fri, Jan 17, 2014 at 11:16 AM, Barry Warsaw wrote:
On Jan 17, 2014, at 11:00 AM, Brett Cannon wrote:
I would rephrase it to "switch to %-formatting for bytes usage for their common code base".
-1. %-formatting is so neanderthal. :)
Very much so, which is why I'm willing to let it be bastardized in Python 3.5 for the sake of porting but not bytes.format(). =) I'm keeping format() clean for my nieces and nephew to use; they can just turn their nose up at %-formatting when they are old enough to program.
Given the problems with implementing it, I'm more than willing to drop bytes.format() from PEP 461 (not that it's my PEP). But if we think that %-formatting is neanderthal and will get dropped in the Python 4000 timeframe
I hope not!
(that is, someday in the far future), then I think we should have some advice to give to people who are writing new 3.x code for the non-porting use-cases addressed by the PEP. I'm specifically thinking of new code that wants to format some bytes for an on-the-wire ascii-like protocol.
%-interpolation handles this use case well, format does not.
Is it: b'Content-Length: ' + str(47).encode('ascii') or b'Content-Length: {}.format(str(47).encode('ascii')) or something better?
Ew. Neither of those look better than b'Content-Length: %d' % 47 -- ~Ethan~
Responding to two posts at once, as I consider them On 1/17/2014 11:00 AM, Brett Cannon wrote:
I would rephrase it to "switch to %-formatting for bytes usage for their common code base". If they are working with actual text then using str.format() still works (and is actually nicer to use IMO). It actually might make the str/bytes relationship even clearer, especially if we start to promote that str.format() is for text and %-formatting is for bytes.
Good idea, I think: printf % formatting was invented for formatting ascii text in bytestrings as it was being output (although sprintf allowed not-output). In retrospect, I think we should have introduced unicode.format when unicode was introduced in 2.0 and perhap never have had unicode % formatting. Or we should have dropped str % instead of bytes % in 3.0. On 1/17/2014 12:13 PM, Eric V. Smith wrote:
But if we think that %-formatting is neanderthal and will get dropped in the Python 4000 timeframe (that is, someday in the far future),
Some people, such as Martin Loewis, have a different opinion of %-formatting and will fight deprecating it *ever*. (I suspect that %-format opinions are influenced by one's current relation to C.)
then I think we should have some advice to give to people who are writing new 3.x code for the non-porting use-cases addressed by the PEP. I'm specifically thinking of new code that wants to format some bytes for an on-the-wire ascii-like protocol.
If we add %-formatting back in 3.5 for its original purpose, formatting ascii in bytes for output, I think we should drop the idea of later deprecating it (a few releases later) for that purpose. I think the PEP should even say so, that bytes % will remain indefinitely even if str % were to be dropped in favor of str.format. I would consider dropping unicode(now string).__mod__ in favor of .format to still be an eventual option, especially if someone were to write a converter. -- Terry Jan Reedy
On 1/17/2014 6:50 AM, Eric V. Smith wrote:
Following up, I think this leaves us with 3 choices:
1. Do not implement bytes.format(). We tell any 2.x code that's written to use str.format() to switch to %-formatting for their common code base.
2. Add the simplistic version of bytes.format() that I describe above, restricted to accepting bytes, int, and float (and no subclasses). Some 2.x code will work, some will need to change to %-formatting.
3. Add bytes.format() and the __format_ascii__ protocol. We might want to also add a format_ascii() builtin, to match __format__ and format(). This would require the least change to 2.x code that uses str.format() and wants to move to bytes.format(), but would require some work on the 3.x side.
I'd advocate 1 or 2.
Nice summary. I'd advocate 1 or 3.
On Thu, Jan 16, 2014 at 08:23:13AM -0800, Ethan Furman wrote:
As I understand it, str.format will call the object's __format__. So, for example, if I say:
u'the value is: %d' % myNum(17)
then it will be myNum.__format__ that gets called, not int.__format__;
I seem to have missed something, because I am completely confused... Why are you talking about str.format and then show an example using % instead? %d calls __str__, not __format__. This is in Python 3.3: py> class MyNum(int): ... def __str__(self): ... print("Calling MyNum.__str__") ... return super().__str__() ... def __format__(self): ... print("Calling MyNum.__format__") ... return super().__format__() ... py> n = MyNum(17) py> u"%d" % n Calling MyNum.__str__ '17' By analogy, if we have a bytes %d formatting, surely it should either: (1) call type(n).__bytes__(n), which is guaranteed to raise if the result isn't ASCII (i.e. like len() raises if the result isn't an int); or (2) call type(n).__str__(n).encode("ascii", "strict"). Personally, I lean towards (2), even though that means you can't have a single class provide an ASCII string to b'%d' and a non-ASCII string to u'%d'.
this is precisely what we don't want, since can't know that myNum is only going to return ASCII characters.
It seems to me that Consenting Adults applies here. If class MyNum returns a non-ASCII string, then you ought to get a runtime exception, exactly the same as happens with just about every other failure in Python. If you don't want that possible exception, then don't use MyNum, or explicitly wrap it in a call to int: b'the value is: %d' % int(MyNum(17)) The *worst* solution would be to completely ignore MyNum.__str__. That's a nasty violation of the Principle Of Least Surprise, and will lead to confusion ("why isn't my class' __str__ method being called?") and bugs. * Explicit is better than implicit -- better to explicitly wrap MyNum in a call to int() than to have bytes %d automagically do it for you; * Special cases aren't special enough to break the rules -- bytes %d isn't so special that standard Python rules about calling special methods should be ignored; * Errors should never pass silently -- if MyNum does the wrong thing when used with bytes %d, you should get an exception.
This is why I would have bytes.__format__, as part of its parsing, call int, index, or float depending on the format code; so the above example would have bytes.__format__ calling int() on myNum(17),
The above example you give doesn't have any bytes in it. Can you explain what you meant to say? I'm guessing you intended this: b'the value is: %d' % MyNum(17) rather than using u'' as actually given, but I don't really know. -- Steven
On 01/16/2014 11:47 PM, Steven D'Aprano wrote:
On Thu, Jan 16, 2014 at 08:23:13AM -0800, Ethan Furman wrote:
As I understand it, str.format will call the object's __format__. So, for example, if I say:
u'the value is: %d' % myNum(17)
then it will be myNum.__format__ that gets called, not int.__format__;
I seem to have missed something, because I am completely confused... Why are you talking about str.format and then show an example using % instead?
Sorry, PEP 46x fatigue. :/ It should have been u'the value is {:d}'.format(myNum(17)) and yes I meant the str type.
%d calls __str__, not __format__. This is in Python 3.3:
py> class MyNum(int): ... def __str__(self): ... print("Calling MyNum.__str__") ... return super().__str__() ... def __format__(self): ... print("Calling MyNum.__format__") ... return super().__format__() ... py> n = MyNum(17) py> u"%d" % n Calling MyNum.__str__ '17'
And that's a bug we fixed in 3.4: Python 3.4.0b1 (default:172a6bfdd91b+, Jan 5 2014, 06:39:32) [GCC 4.7.3] on linux Type "help", "copyright", "credits" or "license" for more information. --> class myNum(int): ... def __int__(self): ... return 7 ... def __index__(self): ... return 11 ... def __float__(self): ... return 13.81727 ... def __str__(self): ... print('__str__') ... return '1' ... def __repr__(self): ... print('__repr__') ... return '2' ... --> '%d' % myNum() '0' --> '%f' % myNum() '13.817270' After all, consider:
'%d' % True '1' '%s' % True 'True'
So, in fact, on subclasses __str__ should *not* be called to get the integer representation. First we do a conversion to make sure we have an int (or float, or ...), and then we call __str__ on our tried and trusted genuine core type.
The *worst* solution would be to completely ignore MyNum.__str__. That's a nasty violation of the Principle Of Least Surprise, and will lead to confusion ("why isn't my class' __str__ method being called?")
Because you asked for a numeric representation, not a string representation [1]. -- ~Ethan~ [1] for all the gory details, see: http://bugs.python.org/issue18780 http://bugs.python.org/issue18738
On Thu, Jan 16, 2014 at 8:45 AM, Brett Cannon <brett@python.org> wrote:
Fine, if you're worried about bytes.format() overstepping by implicitly calling str.encode() on the return value of __format__() then you will need __bytes__format__() to get equivalent support.
Could we just re-use PEP-3101's note (easily updated for Python 3): Note for Python 2.x: The 'format_spec' argument will be either a string object or a unicode object, depending on the type of the original format string. The __format__ method should test the type of the specifiers parameter to determine whether to return a string or unicode object. It is the responsibility of the __format__ method to return an object of the proper type. If __format__ receives a format_spec of type bytes, it should return bytes. For such cases on objects that cannot support bytes (i.e. for str), it can raise. This appears to avoid the need for additional methods. (As does Nick's proposal of leaving it out for now.)
On Thu, Jan 16, 2014 at 11:33 AM, Michael Urman <murman@gmail.com> wrote:
On Thu, Jan 16, 2014 at 8:45 AM, Brett Cannon <brett@python.org> wrote:
Fine, if you're worried about bytes.format() overstepping by implicitly calling str.encode() on the return value of __format__() then you will need __bytes__format__() to get equivalent support.
Could we just re-use PEP-3101's note (easily updated for Python 3):
Note for Python 2.x: The 'format_spec' argument will be either a string object or a unicode object, depending on the type of the original format string. The __format__ method should test the type of the specifiers parameter to determine whether to return a string or unicode object. It is the responsibility of the __format__ method to return an object of the proper type.
If __format__ receives a format_spec of type bytes, it should return bytes. For such cases on objects that cannot support bytes (i.e. for str), it can raise. This appears to avoid the need for additional methods. (As does Nick's proposal of leaving it out for now.)
That's a very good catch, Michael! I think that makes sense if there is precedence. Unfortunately that bit from the PEP never made it into the documentation so I'm not sure if there is a backwards-compatibility worry.
On 1/16/2014 8:41 AM, Brett Cannon wrote:
That's a very good catch, Michael! I think that makes sense if there is precedence. Unfortunately that bit from the PEP never made it into the documentation so I'm not sure if there is a backwards-compatibility worry.
No. If __format__ is called with bytes format, and returns str, there would be an exception generated on the spot. If __format__ is called with bytes format, and tries to use it as str, there would be an exception generated on the spot. Prior to 3.whenever-this-is-implemented, Python 3 only provides str formats to __format__, right? So new code is required to pass bytes to __format__.
Michael Urman <murman@gmail.com> wrote:
If __format__ receives a format_spec of type bytes, it should return bytes. For such cases on objects that cannot support bytes (i.e. for str), it can raise. This appears to avoid the need for additional methods. (As does Nick's proposal of leaving it out for now.)
That's an interesting idea. I proposed __ascii__ as a analogous method to __format__ for bytes formatting and to have %-interpolation use it. However, overloading __format__ based on the type of the argument could work. I see with Python 3: >>> (1).__format__(b'') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: must be str, not bytes A TypeError exception is what we want if the object does not support bytes formatting. Some possible problems: - It could be hard to provide a helpful exception message since it is generated inside the __format__ method rather than inside the bytes.__mod__ method (in the case of a missing __ascii__ method). The most common error will be using a str object and so we could modify the __format__ method of str to provide a nice hint (use encode()). - Is there some risk that an object will unwittingly implement a __format__ method that unintentionally accepts a bytes argument? That requires some investigation.
On Thu, Jan 16, 2014 at 11:13 AM, Neil Schemenauer <nas@arctrix.com> wrote:
A TypeError exception is what we want if the object does not support bytes formatting. Some possible problems:
- It could be hard to provide a helpful exception message since it is generated inside the __format__ method rather than inside the bytes.__mod__ method (in the case of a missing __ascii__ method). The most common error will be using a str object and so we could modify the __format__ method of str to provide a nice hint (use encode()).
The various format functions could certainly intercept and wrap exceptions raised by __format__ methods. Once the core types were modified to expect bytes in format_spec, however, this may not be critical; __format__ methods which delegate would work as expected, str could certainly be clear about why it raised, and custom implementations would be handled per comments I'll make on your second point. Overall I suspect this is no worse than unhandled values in the format_spec are today.
- Is there some risk that an object will unwittingly implement a __format__ method that unintentionally accepts a bytes argument? That requires some investigation.
Agreed. Some quick armchair calculations suggest to me that there are three likely outcomes: - Properly handle the type (perhaps written with the 2.x clause in mind) - Raise an exception internally (perhaps ValueError, such as from format(3, 'q')) - Mishandle and return a str (perhaps due to to if/else defaulting) The first and second outcome may well reflect what we want, and the third could easily be detected and turned into an exception by the format functions. I'm uncertain whether this reflects all the scenarios we would care about.
16.01.2014 17:33, Michael Urman wrote:
On Thu, Jan 16, 2014 at 8:45 AM, Brett Cannon <brett@python.org> wrote:
Fine, if you're worried about bytes.format() overstepping by implicitly calling str.encode() on the return value of __format__() then you will need __bytes__format__() to get equivalent support.
Could we just re-use PEP-3101's note (easily updated for Python 3):
Note for Python 2.x: The 'format_spec' argument will be either a string object or a unicode object, depending on the type of the original format string. The __format__ method should test the type of the specifiers parameter to determine whether to return a string or unicode object. It is the responsibility of the __format__ method to return an object of the proper type.
If __format__ receives a format_spec of type bytes, it should return bytes. For such cases on objects that cannot support bytes (i.e. for str), it can raise. This appears to avoid the need for additional methods. (As does Nick's proposal of leaving it out for now.)
-1. I'd treat the format()+.__format__()+str.format()-"ecosystem" as a nice text-data-oriented, *complete* Py3k feature, backported to Python 2 to share the benefits of the feature with it as well as to make the 2-to-3 transition a bit easier. IMHO, the PEP-3101's note cited above just describes a workaround over the flaws of the Py2's obsolete text model. Moving such complications into Py3k would make the feature (and especially the ability to implement your own .__format__()) harder to understand and make use of -- for little profit. Such a move is not needed for compatibility. And, IMHO, the format()/__format__()/str.format()-matter is all about nice and flexible *text* formatting, not about binary data interpolation. 16.01.2014 10:56, Nick Coghlan wrote:
I have a different proposal: let's *just* add mod formatting to bytes, and leave the extensible formatting system as a text only operation.
We don't really care if bytes supports that method for version compatibility purposes, and the deliberate flexibility of the design makes it hard to translate into the binary domain.
So let's just not provide that - let's accept that, for the binary domain, printf style formatting is just a better fit for the job :)
+1! However, I am not sure if %s should be limited to bytes-like objects. As "practicality beats purity", I would be +0.5 for enabling the following: - input type supports Py_buffer? use it to collect the necessary bytes - input type has the __bytes__() method? use it to collect the necessary bytes - input type has the encode() method? raise TypeError - otherwise: use something equivalent to ascii(obj).encode('ascii') (note that it would nicely format numbers + format other object in more-or-less useful way without the fear of encountering a non-ascii data). another option: use str()-representation of strictly defined types, e.g.: int, float, decimal.Decimal, fractions.Fraction... Cheers. *j
On Thu, Jan 16, 2014 at 3:06 PM, Jan Kaliszewski <zuo@chopin.edu.pl> wrote:
I'd treat the format()+.__format__()+str.format()-"ecosystem" as a nice text-data-oriented, *complete* Py3k feature, backported to Python 2 to share the benefits of the feature with it as well as to make the 2-to-3 transition a bit easier.
IMHO, the PEP-3101's note cited above just describes a workaround over the flaws of the Py2's obsolete text model. Moving such complications into Py3k would make the feature (and especially the ability to implement your own .__format__()) harder to understand and make use of -- for little profit.
Such a move is not needed for compatibility. And, IMHO, the format()/__format__()/str.format()-matter is all about nice and flexible *text* formatting, not about binary data interpolation.
[disclaimer: I personally don't have many use cases for any bytes formatting.] Yet there is still a strong symmetry between str and bytes that makes bytes easier to use. I don't always use formatting, but when I do I use .format(). :) never-been-a-fan-of-mod-formatting-ly yours, -eric
This looks pretty good to me. I don't think we should limit operands based on type, that's anti-Pythonic IMHO. We should use duck-typing and that means a special method, I think. We could introduce a new one but __bytes__ looks like it can work. Otherwise, maybe __ascii__ is a good name. Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned, no matter what the *value* (we don't want to repeat the hell of Python 2's unicode to str coercion which depends on the value of the unicode object). Objects that already contain encoded bytes or arbitrary bytes can also implement __bytes__. Ethan Furman <ethan@stoneleaf.us> wrote:
%s, because it is the most general, has the most convoluted resolution:
This becomes much simpler: - does the object implement __bytes__? call it and use the value otherwise raise TypeError
It has been suggested to use %b for bytes instead of %s.
- Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s.
+1. %b might be conceptually neater but ease of migration trumps that, IMHO.
It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
Right. That would put us back in Python 2 unicode -> str coercion hell. Thanks for writing the PEP. Neil
On Wed, 15 Jan 2014 15:47:43 +0000 (UTC) Neil Schemenauer <nas@arctrix.com> wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned, no matter what the *value*
I think that's a slippery slope. __bytes__ should mean that the object has a well-known bytes equivalent or encoding, not that its __str__ happens to be pure ASCII. (for example, it would be fine for a HTTP message class to define a __bytes__ method) Also, consider that if e.g. float had a __bytes__ method, then bytes(2.0) would start returning b'2.0', while bytes(2) would still need to return b'\x00\x00'. Regards Antoine.
On Wed, 15 Jan 2014, Antoine Pitrou wrote:
On Wed, 15 Jan 2014 15:47:43 +0000 (UTC) Neil Schemenauer <nas@arctrix.com> wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned, no matter what the *value*
I think that's a slippery slope. __bytes__ should mean that the object has a well-known bytes equivalent or encoding, not that its __str__ happens to be pure ASCII.
+1
(for example, it would be fine for a HTTP message class to define a __bytes__ method)
Also, consider that if e.g. float had a __bytes__ method, then bytes(2.0) would start returning b'2.0', while bytes(2) would still need to return b'\x00\x00'.
Not actually suggesting the following for a number of reasons including but not limited to the consistency of floating point formats across different implementations, but it would make more sense for bytes (2.0) to return the 8-byte IEEE representation than for it to return the ASCII encoding of the decimal representation of the number. Isaac Morland CSCF Web Guru DC 2619, x36650 WWW Software Specialist
On 01/15/2014 08:04 AM, Antoine Pitrou wrote:
On Wed, 15 Jan 2014 15:47:43 +0000 (UTC) Neil Schemenauer <nas@arctrix.com> wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned, no matter what the *value*
I think that's a slippery slope. __bytes__ should mean that the object has a well-known bytes equivalent or encoding, not that its __str__ happens to be pure ASCII.
Agreed. -- ~Ethan~
Antoine Pitrou <solipsis@pitrou.net> wrote:
On Wed, 15 Jan 2014 15:47:43 +0000 (UTC) Neil S wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned, no matter what the *value*
I think that's a slippery slope. __bytes__ should mean that the object has a well-known bytes equivalent or encoding, not that its __str__ happens to be pure ASCII.
After poking around some more into the Python 3 source, I agree. It seems too late to change bytes(<integer>) and bytearray(<integer>). We should have used a keyword only argument but too late now (tp_new is a mess). I can also agree that pushing the ASCII-centric behavior into the bytes() constructor goes too far. If we limit the ASCII-centric behavior solely to % and format(), that seems a reasonable trade-off for usability. As others have argued, once you are using format codes, you are pretty clearly dealing with ASCII encoding. I feel strongly that % and format on bytes needs to use duck-typing and not type checking. Also, formatting falures must be due to types and not due to values. If we can get agreement on these two principles, that will help guide the design. Those principles absolutely rule out call calling encode('ascii') automatically. I'm not deeply intimate with format() but I think it also rules out calling __format__. Could we introduce only __bformat__ and have the % operator call it? That would only require implementing one new special method instead of two. Neil
All sounds good. A fleeting thought about constructors: you can always add alternative constructors as class methods (like datetime does). On Wed, Jan 15, 2014 at 10:09 AM, Neil Schemenauer <nas@arctrix.com> wrote:
Antoine Pitrou <solipsis@pitrou.net> wrote:
On Wed, 15 Jan 2014 15:47:43 +0000 (UTC) Neil S wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned, no matter what the *value*
I think that's a slippery slope. __bytes__ should mean that the object has a well-known bytes equivalent or encoding, not that its __str__ happens to be pure ASCII.
After poking around some more into the Python 3 source, I agree. It seems too late to change bytes(<integer>) and bytearray(<integer>). We should have used a keyword only argument but too late now (tp_new is a mess).
I can also agree that pushing the ASCII-centric behavior into the bytes() constructor goes too far. If we limit the ASCII-centric behavior solely to % and format(), that seems a reasonable trade-off for usability. As others have argued, once you are using format codes, you are pretty clearly dealing with ASCII encoding.
I feel strongly that % and format on bytes needs to use duck-typing and not type checking. Also, formatting falures must be due to types and not due to values. If we can get agreement on these two principles, that will help guide the design.
Those principles absolutely rule out call calling encode('ascii') automatically. I'm not deeply intimate with format() but I think it also rules out calling __format__.
Could we introduce only __bformat__ and have the % operator call it? That would only require implementing one new special method instead of two.
Neil
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
Neil Schemenauer <nas@arctrix.com> wrote:
We should use duck-typing and that means a special method, I think. We could introduce a new one but __bytes__ looks like it can work. Otherwise, maybe __ascii__ is a good name.
I poked around the Python 3 source. Using __bytes__ has some downsides, e.g. the following would happen: >>> bytes(12) b'12' Perhaps that's a little too ASCII-centric. OTOH, UTF-8 seems to be winning the encoding war and so the above could be argued as reasonable behavior. I think forcing people to explicitly choose an encoding for str objects will be sufficient to avoid the bytes/str mess we have in Python 2. Unfortunately, that change conflicts with the current behavior: >>> bytes(12) b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' Would it be too disruptive to change that? It doesn't appear to be too useful and we could do it using a keyword argument, e.g.: bytes(size=12) I notice something else surprising to me: >>> class Test(object): ... def __bytes__(self): ... return b'test' ... >>> with open('test', 'wb') as fp: ... fp.write(Test()) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> TypeError: 'Test' does not support the buffer interface I'd expect that to write b'test' to the file, not raise an error. Regards, Neil
Neil Schemenauer wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned,
I think __ascii_ would be a better name. I'd expect a method called __bytes__ on an int to return some version of its binary value. -- Greg
On Thu, Jan 16, 2014 at 10:55:31AM +1300, Greg Ewing wrote:
Neil Schemenauer wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned,
I think __ascii_ would be a better name. I'd expect a method called __bytes__ on an int to return some version of its binary value.
+1 -- Steven
On Wed, Jan 15, 2014 at 5:00 PM, Steven D'Aprano <steve@pearwood.info>wrote:
On Thu, Jan 16, 2014 at 10:55:31AM +1300, Greg Ewing wrote:
Neil Schemenauer wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned,
I think __ascii_ would be a better name. I'd expect a method called __bytes__ on an int to return some version of its binary value.
+1
If we are going the route of a new magic method then __ascii__ or __bytes_format__ get my vote as long as they only return bytes (I see no need to abbreviate to __bformat__ or __formatb__ when we have method names as long as __text_signature__ now).
On 15/01/2014 22:22, Brett Cannon wrote:
On Wed, Jan 15, 2014 at 5:00 PM, Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:
On Thu, Jan 16, 2014 at 10:55:31AM +1300, Greg Ewing wrote: > Neil Schemenauer wrote: > >Objects that implement __str__ can also implement __bytes__ if they > >can guarantee that ASCII characters are always returned, > > I think __ascii_ would be a better name. I'd expect > a method called __bytes__ on an int to return some > version of its binary value.
+1
If we are going the route of a new magic method then __ascii__ or __bytes_format__ get my vote as long as they only return bytes (I see no need to abbreviate to __bformat__ or __formatb__ when we have method names as long as __text_signature__ now).
__bytes_format__ gets my vote as it's blatantly obvious what it does. I'm against __ascii__ as I'd automatically associate that with ascii in the same way that I associate str with __str__ and repr with __repr__. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On Wed, Jan 15, 2014 at 10:34:48PM +0000, Mark Lawrence wrote:
On 15/01/2014 22:22, Brett Cannon wrote:
On Wed, Jan 15, 2014 at 5:00 PM, Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:
On Thu, Jan 16, 2014 at 10:55:31AM +1300, Greg Ewing wrote: > Neil Schemenauer wrote: > >Objects that implement __str__ can also implement __bytes__ if they > >can guarantee that ASCII characters are always returned, > > I think __ascii_ would be a better name. I'd expect > a method called __bytes__ on an int to return some > version of its binary value.
+1
If we are going the route of a new magic method then __ascii__ or __bytes_format__ get my vote as long as they only return bytes (I see no need to abbreviate to __bformat__ or __formatb__ when we have method names as long as __text_signature__ now).
__bytes_format__ gets my vote as it's blatantly obvious what it does.
What precisely does it do? If it's so obvious, why is this thread so long?
I'm against __ascii__ as I'd automatically associate that with ascii in the same way that I associate str with __str__ and repr with __repr__.
That's a good point. I forgot about ascii(). -- Steven
On 1/15/2014 4:03 PM, Steven D'Aprano wrote:
What precisely does it do? If it's so obvious, why is this thread so long?
It produces a formatted representation of the object in bytes. For numbers, that would probably be expected to be ASCII digits and punctuation. But other items are not as obvious. bytes would probably be expected not to have a __bytes_format__, but if a subclass defined one, it might be HEX or Base64 of the base bytes. Or if the subclass is ASCII text oriented, it might be the ASCII text version of the base bytes (which would be identical to the base bytes, except for the type transformation). str would probably be expected not to have a __bytes_format__, but if a subclass defined one, it might be HEX or Base64, or it might be a specific encoding of the base str. Other objects might generate an ASCII __repr__, if they define the method. It took a lot of talk to reach the conclusion, if it has been reached, that none of the solution are general enough without defining something like __bytes_format__. And before that, a lot of talk to decide that % interpolation already had an ASCII bias.
On Wed, Jan 15, 2014 at 05:46:07PM -0800, Glenn Linderman wrote:
On 1/15/2014 4:03 PM, Steven D'Aprano wrote:
What precisely does it do? If it's so obvious, why is this thread so long?
It produces a formatted representation of the object in bytes. For numbers, that would probably be expected to be ASCII digits and punctuation.
But other items are not as obvious.
My point exactly. -- Steven
Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Neil Schemenauer wrote:
Objects that implement __str__ can also implement __bytes__ if they can guarantee that ASCII characters are always returned,
I think __ascii_ would be a better name. I'd expect a method called __bytes__ on an int to return some version of its binary value.
I realize now we can't use __bytes__. Currently, passing an int to bytes() causes it to construct an object with that many null bytes. If we are going to support format() (I'm not convinced it is nessary and could easily be added in a later version), then we need an equivalent to __format__. My vote is either: def __formatascii__(self, spec): ... or def __ascii__(self, spec): ... Previously I was thinking of __bformat__ or __formatb__ but having ascii in the method name is a great reminder. Objects with a natural arbitrary byte representation can implement __bytes__ and %s should use that if it exists. Neil
On 14/01/2014 19:56, Ethan Furman wrote:
Duh. Here's the text, as well. ;)
%s, because it is the most general, has the most convoluted resolution:
- input type is bytes? pass it straight through
- input type is numeric? use its __xxx__ [1] [2] method and ascii-encode it (strictly)
- input type is something else? use its __bytes__ method; if there isn't one, raise an exception [3]
Examples:
>>> b'%s' % b'abc' b'abc'
>>> b'%s' % 3.14 b'3.14'
>>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
For completeness I believe %r and %a should be included here as well. FTR %a appears to have been introduced in 3.2, but I couldn't find anything in the What's New and there's no note in the docs http://docs.python.org/3/library/stdtypes.html#printf-style-string-formattin... to indicate when it first came into play. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On Tue, Jan 14, 2014 at 2:55 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
This PEP goes a but further than PEP 460 does, and hopefully spells things out in enough detail so there is no confusion as to what is meant.
Are we going down the PEP route with the various ideas? Guido, do you want one from me as well or should I not bother?
On 01/14/2014 01:05 PM, Brett Cannon wrote:
On Tue, Jan 14, 2014 at 2:55 PM, Ethan Furman wrote:
This PEP goes a but further than PEP 460 does, and hopefully spells things out in enough detail so there is no confusion as to what is meant.
Are we going down the PEP route with the various ideas? Guido, do you want one from me as well or should I not bother?
While I can't answer for Guido, I will say I authored this PEP because Antoine didn't want 460 to be any more liberal than it already was. If you collect your ideas together, I'll add them to 461 as questions or discussions or however is appropriate (assuming you're willing to go that route). -- ~Ethan~
I think of PEP 460 as the strict version and PEP 461 as the lenient version. I don't think it makes sense to have more variants. So please collaborate with whichever you like best. :-) On Tue, Jan 14, 2014 at 1:11 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/14/2014 01:05 PM, Brett Cannon wrote:
On Tue, Jan 14, 2014 at 2:55 PM, Ethan Furman wrote:
This PEP goes a but further than PEP 460 does, and hopefully spells things out in enough detail so there is no confusion as to what is meant.
Are we going down the PEP route with the various ideas? Guido, do you want one from me as well or should I not bother?
While I can't answer for Guido, I will say I authored this PEP because Antoine didn't want 460 to be any more liberal than it already was.
If you collect your ideas together, I'll add them to 461 as questions or discussions or however is appropriate (assuming you're willing to go that route).
-- ~Ethan~
-- --Guido van Rossum (python.org/~guido)
15.01.14 00:40, Guido van Rossum написав(ла):
I think of PEP 460 as the strict version and PEP 461 as the lenient version. I don't think it makes sense to have more variants. So please collaborate with whichever you like best. :-)
Perhaps the consensus will be PEP 460.5? Or PEP 460.3, or may be PEP 460.7?
On 14/01/2014 19:55, Ethan Furman wrote:
This PEP goes a but further than PEP 460 does, and hopefully spells things out in enough detail so there is no confusion as to what is meant.
-- ~Ethan~
Out of plain old curiosity do we have to consider PEP 292 string templates in any way, shape or form, or regarding this debate have they been safely booted into touch? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On 01/14/2014 02:41 PM, Mark Lawrence wrote:
On 14/01/2014 19:55, Ethan Furman wrote:
This PEP goes a but further than PEP 460 does, and hopefully spells things out in enough detail so there is no confusion as to what is meant.
-- ~Ethan~
Out of plain old curiosity do we have to consider PEP 292 string templates in any way, shape or form, or regarding this debate have they been safely booted into touch?
Well, I'm not sure what "booted into touch" means, but yes, we can ignore string templates. :) -- ~Ethan~
Current copy of PEP, many modifications from all the feedback. Thank you to everyone. I know it's been a long week (feels a lot longer!) while all this was hammered out, but I think we're getting close! ============================ Abstract ======== This PEP proposes adding the % and {} formatting operations from str to bytes [1]. Overriding Principles ===================== In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words:: - duck-typing to allow/reject entry into a byte-stream - no value generated errors Proposed semantics for bytes formatting ======================================= %-interpolation --------------- All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers, except locale. Example:: >>> b'%4x' % 10 b' a' %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. Example: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' %s is restricted in what it will accept:: - input type supports Py_buffer? use it to collect the necessary bytes - input type is something else? use its __bytes__ method; if there isn't one, raise an exception [2] Examples: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it? .. note:: Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence:: 'a string'.encode('latin-1') format ------ The format mini language codes, where they correspond with the %-interpolation codes, will be used as-is, with three exceptions:: - !s is not supported, as {} can mean the default for both str and bytes, in both Py2 and Py3. - !b is supported, and new Py3k code can use it to be explicit. - no other __format__ method will be called. Numeric Format Codes -------------------- To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G). Unsupported codes ----------------- %r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported. !r and !a are not supported. The n integer and float format code is not supported. Open Questions ============== Currently non-numeric objects go through:: - Py_buffer - __bytes__ - failure Do we want to add a __format_bytes__ method in there? - Guaranteed to produce only ascii (as in b'10', not b'\x0a') - Makes more sense than using __bytes__ to produce ascii output - What if an object has both __bytes__ and __format_bytes__? Do we need to support all the numeric format codes? The floating point exponential formats seem less appropriate, for example. Proposed variations =================== It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded. It has been suggested to use %b for bytes instead of %s. - Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s. It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s. - Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed. It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'"). - Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed. Footnotes ========= .. [1] string.Template is not under consideration. .. [2] TypeError, ValueError, or UnicodeEncodeError? ====================== -- ~Ethan~
Hi Ethan, I haven't chimed into this discussion, but the direction it's headed recently seems right to me. Thanks for putting together a PEP. Some comments on it: On 01/15/2014 05:13 PM, Ethan Furman wrote:
============================ Abstract ========
This PEP proposes adding the % and {} formatting operations from str to bytes [1].
I think the PEP could really use a rationale section summarizing _why_ these formatting operations are being added to bytes; namely that they are useful when working with various ASCIIish-but-not-properly-text network protocols and file formats, and in particular when porting code dealing with such formats/protocols from Python 2. Also I think it would be useful to have a section summarizing the primary objections that have been raised, and why those objections have been overruled (presuming the PEP is accepted). For instance: the main objection, AIUI, has been that the bytes type is for pure bytes-handling with no assumptions about encoding, and thus we should not add features to it that assume ASCIIness, and that may be attractive nuisances for people writing bytes-handling code that should not assume ASCIIness but will once they use the feature. And the refutation: that the bytes type already provides some operations that assume ASCIIness, and these new formatting features are no more of an attractive nuisance than those; since the syntax of the formatting mini-languages themselves itself assumes ASCIIness, there is not likely to be any temptation to use it with binary data that cannot. Although it can be hard to arrive at accurate and agreed-on summaries of the discussion, recording such summaries in the PEP is important; it may help save our future selves and colleagues from having to revisit all these same discussions and megathreads.
Overriding Principles =====================
In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words::
- duck-typing to allow/reject entry into a byte-stream - no value generated errors
This seems self-contradictory; "isinstance" is type-checking, which is the opposite of duck-typing. A duck-typing implementation would not use isinstance, it would call / check for the existence of a certain magic method instead. I think it might also be good to expand (very) slightly on what "the problems of auto-conversion and value-generated exceptions" are; that is, that the benefit of Python 3's model is that encoding is explicit, not implicit, making it harder to unwittingly write code that works as long as all data is ASCII, but fails as soon as someone feeds in non-ASCII text data. Not everyone who reads this PEP will be steeped in years of discussion about the relative merits of the Python 2 vs 3 models; it doesn't hurt to spell out a few assumptions.
Proposed semantics for bytes formatting =======================================
%-interpolation ---------------
All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers, except locale.
Example::
b'%4x' % 10 b' a'
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1.
Example:
>>> b'%c' % 48 b'0'
>>> b'%c' % b'a' b'a'
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
- input type is something else? use its __bytes__ method; if there isn't one, raise an exception [2]
Examples:
>>> b'%s' % b'abc' b'abc'
>>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method
>>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
.. note::
Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence::
'a string'.encode('latin-1')
format ------
The format mini language codes, where they correspond with the %-interpolation codes, will be used as-is, with three exceptions::
- !s is not supported, as {} can mean the default for both str and bytes, in both Py2 and Py3. - !b is supported, and new Py3k code can use it to be explicit. - no other __format__ method will be called.
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
Unsupported codes -----------------
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported.
!r and !a are not supported.
The n integer and float format code is not supported.
Open Questions ==============
Currently non-numeric objects go through::
- Py_buffer - __bytes__ - failure
Do we want to add a __format_bytes__ method in there?
- Guaranteed to produce only ascii (as in b'10', not b'\x0a') - Makes more sense than using __bytes__ to produce ascii output - What if an object has both __bytes__ and __format_bytes__?
Do we need to support all the numeric format codes? The floating point exponential formats seem less appropriate, for example.
Proposed variations ===================
It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded.
It has been suggested to use %b for bytes instead of %s.
- Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s.
It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.
Footnotes =========
.. [1] string.Template is not under consideration. .. [2] TypeError, ValueError, or UnicodeEncodeError?
TypeError seems right to me. Definitely not UnicodeEncodeError - refusal to implicitly encode is not at all the same thing as an encoding error. Carl
On 01/15/2014 05:17 PM, Carl Meyer wrote:
I think the PEP could really use a rationale section
It will have one before it's done.
Also I think it would be useful to have a section summarizing the primary objections that have been raised, and why those objections have been overruled
Excellent point. That section will also be present.
In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words::
- duck-typing to allow/reject entry into a byte-stream - no value generated errors
This seems self-contradictory; "isinstance" is type-checking, which is the opposite of duck-typing.
Good point, I'll reword that. It will be duck-typing.
I think it might also be good to expand (very) slightly on what "the problems of auto-conversion and value-generated exceptions" are
Will do.
.. [2] TypeError, ValueError, or UnicodeEncodeError?
TypeError seems right to me. Definitely not UnicodeEncodeError - refusal to implicitly encode is not at all the same thing as an encoding error.
That's the direction I'm leaning, too. Thanks for your comments! -- ~Ethan~
On 16 Jan 2014 11:45, "Carl Meyer" <carl@oddbird.net> wrote:
Hi Ethan,
I haven't chimed into this discussion, but the direction it's headed recently seems right to me. Thanks for putting together a PEP. Some comments on it:
On 01/15/2014 05:13 PM, Ethan Furman wrote:
============================ Abstract ========
This PEP proposes adding the % and {} formatting operations from str to bytes [1].
I think the PEP could really use a rationale section summarizing _why_ these formatting operations are being added to bytes; namely that they are useful when working with various ASCIIish-but-not-properly-text network protocols and file formats, and in particular when porting code dealing with such formats/protocols from Python 2.
Also I think it would be useful to have a section summarizing the primary objections that have been raised, and why those objections have been overruled (presuming the PEP is accepted). For instance: the main objection, AIUI, has been that the bytes type is for pure bytes-handling with no assumptions about encoding, and thus we should not add features to it that assume ASCIIness, and that may be attractive nuisances for people writing bytes-handling code that should not assume ASCIIness but will once they use the feature.
Close, but not quite - the concern was that this was a feature that didn't *inherently* imply a restriction to ASCII compatible data, but only did so when the numeric formatting codes were used. This made it a source of value dependent compatibility errors based on the format string, akin to the kind of value dependent errors seen when implicitly encoding arbitrary text as ASCII. Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data, thus placing at least the mod-formatting operation in the same category as the currently existing valid-for-sufficiently-ASCII-compatible-data only operations. Current discussions suggest to me that the argument against implicit encoding operations that introduce latent data driven defects may still apply to bytes.format though, so I've reverted to being -1 on that. Cheers, Nick.
And the refutation: that the bytes type already provides some operations that assume ASCIIness, and these new formatting features are no more of an attractive nuisance than those; since the syntax of the formatting mini-languages themselves itself assumes ASCIIness, there is not likely to be any temptation to use it with binary data that cannot.
Although it can be hard to arrive at accurate and agreed-on summaries of the discussion, recording such summaries in the PEP is important; it may help save our future selves and colleagues from having to revisit all these same discussions and megathreads.
Overriding Principles =====================
In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words::
- duck-typing to allow/reject entry into a byte-stream - no value generated errors
This seems self-contradictory; "isinstance" is type-checking, which is the opposite of duck-typing. A duck-typing implementation would not use isinstance, it would call / check for the existence of a certain magic method instead.
I think it might also be good to expand (very) slightly on what "the problems of auto-conversion and value-generated exceptions" are; that is, that the benefit of Python 3's model is that encoding is explicit, not implicit, making it harder to unwittingly write code that works as long as all data is ASCII, but fails as soon as someone feeds in non-ASCII text data.
Not everyone who reads this PEP will be steeped in years of discussion about the relative merits of the Python 2 vs 3 models; it doesn't hurt to spell out a few assumptions.
Proposed semantics for bytes formatting =======================================
%-interpolation ---------------
All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers, except locale.
Example::
b'%4x' % 10 b' a'
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1.
Example:
>>> b'%c' % 48 b'0'
>>> b'%c' % b'a' b'a'
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
- input type is something else? use its __bytes__ method; if there isn't one, raise an exception [2]
Examples:
>>> b'%s' % b'abc' b'abc'
>>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method
>>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
.. note::
Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence::
'a string'.encode('latin-1')
format ------
The format mini language codes, where they correspond with the %-interpolation codes, will be used as-is, with three exceptions::
- !s is not supported, as {} can mean the default for both str and bytes, in both Py2 and Py3. - !b is supported, and new Py3k code can use it to be explicit. - no other __format__ method will be called.
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
Unsupported codes -----------------
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported.
!r and !a are not supported.
The n integer and float format code is not supported.
Open Questions ==============
Currently non-numeric objects go through::
- Py_buffer - __bytes__ - failure
Do we want to add a __format_bytes__ method in there?
- Guaranteed to produce only ascii (as in b'10', not b'\x0a') - Makes more sense than using __bytes__ to produce ascii output - What if an object has both __bytes__ and __format_bytes__?
Do we need to support all the numeric format codes? The floating point exponential formats seem less appropriate, for example.
Proposed variations ===================
It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded.
It has been suggested to use %b for bytes instead of %s.
- Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s.
It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.
Footnotes =========
.. [1] string.Template is not under consideration. .. [2] TypeError, ValueError, or UnicodeEncodeError?
TypeError seems right to me. Definitely not UnicodeEncodeError - refusal to implicitly encode is not at all the same thing as an encoding error.
Carl _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 1/16/2014 5:11 AM, Nick Coghlan wrote:
Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data,
Did you see my explanation, which I wrote in response to one of your earlier posts, of why I think "the parsing of the format string itself assumes ASCII compatible data" that statement is confused and wrong? The above seems to say that what I wrote is impossible, but perhaps I misunderstand what Guido and you mean. Among my questions are "by data, do you mean interpolated objects or interpolated bytes?" and "what restriction on 'data' do you intend by 'ASCII compatible'?". -- Terry Jan Reedy
On Thu, Jan 16, 2014 at 1:18 PM, Terry Reedy <tjreedy@udel.edu> wrote:
On 1/16/2014 5:11 AM, Nick Coghlan wrote:
Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data,
Did you see my explanation, which I wrote in response to one of your earlier posts, of why I think "the parsing of the format string itself assumes ASCII compatible data" that statement is confused and wrong? The above seems to say that what I wrote is impossible, but perhaps I misunderstand what Guido and you mean. Among my questions are "by data, do you mean interpolated objects or interpolated bytes?" and "what restriction on 'data' do you intend by 'ASCII compatible'?".
Can you move the meta-discussion off-list? I'm getting tired of "did you understand what I said". -- --Guido van Rossum (python.org/~guido)
On 1/16/2014 4:59 PM, Guido van Rossum wrote:
I'm getting tired of "did you understand what I said".
I was asking whether I needed to repeat myself, but forget that. I was also saying that while I understand 'ascii-compatible encoding', I do not understand the notion of 'ascii-compatible data' or statements based on it.
On 17 Jan 2014 09:36, "Terry Reedy" <tjreedy@udel.edu> wrote:
On 1/16/2014 4:59 PM, Guido van Rossum wrote:
I'm getting tired of "did you understand what I said".
I was asking whether I needed to repeat myself, but forget that. I was also saying that while I understand 'ascii-compatible encoding', I
do not understand the notion of 'ascii-compatible data' or statements based on it. There are plenty of data formats (like SMTP and HTTP) that are constrained to be ASCII compatible, either globally, or locally in the parts being manipulated by an application (such as a file header). ASCII incompatible segments may be present, but in ways that allow the data processing to handle them correctly. The ASCII assuming methods on bytes objects are there to help in dealing with that kind of data. If the binary data is just one large block in a single text encoding, it's generally easier to just decode it to text, but multipart formats generally don't allow that.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Meta enough that I'll take Guido out of the CC. Nick Coghlan writes:
There are plenty of data formats (like SMTP and HTTP) that are constrained to be ASCII compatible,
"ASCII compatible" is a technical term in encodings, which means "bytes in the range 0-127 always have ASCII coded character semantics, do what you like with bytes in the range 128-255."[1] Worse, it's clearly confusing in this discussion. Let's stop using this term to mean the data format has elements that are defined to contain only bytes with ASCII coded character semantics (which is the relevant restriction AFAICS -- I don't know of any ASCII-compatible formats where the bytes 128-255 are used for any purpose other than encoding non-ASCII characters). OTOH, if it *is* an ASCII-compatible text encoding, the semantics are dubious if the bytes versions of many of these methods/operations are used. A documentation suggestion: It's easy enough to rewrite
constrained to be ASCII compatible, either globally, or locally in the parts being manipulated by an application (such as a file header). ASCII incompatible segments may be present, but in ways that allow the data processing to handle them correctly.
as containing 'well-defined segments constrained to be (strictly) ASCII-encoded' (aka ASCII segments). And then you can say <specified bytes methods> are designed for use *only* on bytes that are ASCII segments; use on other data is likely to cause hard-to-diagnose corruption. If there are other use cases for "ASCII-compatible data formats" as defined above (not worrying about codecs, because they are a very small minority of code-to-be-written at this point), I don't know about them. Does anyone? If there are any, I'll be happy to revise. If not, that seems to be a precise and intelligible statement of the restrictions that is useful to the practical use cases. And nothing stops users who think they know what they're doing from using them in other contexts (which can be documented if they turn out to be broadly useful). Footnotes: [1] "ASCII coded character semantics" is of course mildly ambiguous due to considerations like EOL conventions. But "you know what I'm talking about".
On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:
Meta enough that I'll take Guido out of the CC.
Nick Coghlan writes:
There are plenty of data formats (like SMTP and HTTP) that are constrained to be ASCII compatible,
"ASCII compatible" is a technical term in encodings, which means "bytes in the range 0-127 always have ASCII coded character semantics, do what you like with bytes in the range 128-255."[1]
Examples, and counter-examples, may help. Let me see if I have got this right: an ASCII-compatible encoding may be an ASCII-superset like Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars are encoded to the same bytes as ASCII, and non-ASCII chars are not. A counter-example would be UTF-16, or some of the Asian encodings like Big5. Am I right so far? But Nick isn't talking about an encoding, he's talking about a data format. I think that an ASCII-compatible format means one where (in at least *some* parts of the data) bytes between 0 and 127 have the same meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII character "T". This doesn't mean that every byte 84 means "T", only that some of them do -- hopefully a well-defined sections of the data. Below, you introduce the term "ASCII segments" for these.
Worse, it's clearly confusing in this discussion. Let's stop using this term to mean
the data format has elements that are defined to contain only bytes with ASCII coded character semantics
(which is the relevant restriction AFAICS -- I don't know of any ASCII-compatible formats where the bytes 128-255 are used for any purpose other than encoding non-ASCII characters). OTOH, if it *is* an ASCII-compatible text encoding, the semantics are dubious if the bytes versions of many of these methods/operations are used.
A documentation suggestion: It's easy enough to rewrite
constrained to be ASCII compatible, either globally, or locally in the parts being manipulated by an application (such as a file header). ASCII incompatible segments may be present, but in ways that allow the data processing to handle them correctly.
as
containing 'well-defined segments constrained to be (strictly) ASCII-encoded' (aka ASCII segments).
And then you can say
<specified bytes methods> are designed for use *only* on bytes that are ASCII segments; use on other data is likely to cause hard-to-diagnose corruption.
An example: if you have the byte b'\x63', calling upper() on that will return b'\x43'. That is only meaningful if the byte is intended as the ASCII character "c".
Footnotes: [1] "ASCII coded character semantics" is of course mildly ambiguous due to considerations like EOL conventions. But "you know what I'm talking about".
I think I know what your talking about, but don't know for sure unless I explain it back to you. -- Steven
Steven D'Aprano writes:
On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:
"ASCII compatible" is a technical term in encodings, which means "bytes in the range 0-127 always have ASCII coded character semantics, do what you like with bytes in the range 128-255."[1]
Examples, and counter-examples, may help. Let me see if I have got this right: an ASCII-compatible encoding may be an ASCII-superset like Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars are encoded to the same bytes as ASCII, and non-ASCII chars are not. A counter-example would be UTF-16, or some of the Asian encodings like Big5. Am I right so far?
All correct.
But Nick isn't talking about an encoding, he's talking about a data format. I think that an ASCII-compatible format means one where (in at least *some* parts of the data) bytes between 0 and 127 have the same meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII character "T". This doesn't mean that every byte 84 means "T", only that some of them do -- hopefully a well-defined sections of the data. Below, you introduce the term "ASCII segments" for these.
Yes, except that I believe Nick, as well as the "file-and-wire guys", strengthen "hopefully well-defined" to just "well-defined".
<specified bytes methods> are designed for use *only* on bytes that are ASCII segments; use on other data is likely to cause hard-to-diagnose corruption.
An example: if you have the byte b'\x63', calling upper() on that will return b'\x43'. That is only meaningful if the byte is intended as the ASCII character "c".
Good example.
For the record, we've got a pretty good thread (not this good, though!) over on the numpy list about how to untangle the mess that has resulted from porting text-file-parsing code to py3 (and the underlying issue with the 'S' data type in numpy...) One note from the github issue: """ The use of asbytes originates only from the fact that b'%d' % (20,) does not work. """ So yeah PEP 461! (even if too late for numpy...) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 1/17/2014 4:37 PM, Chris Barker wrote:
For the record, we've got a pretty good thread (not this good, though!) over on the numpy list about how to untangle the mess that has resulted from porting text-file-parsing code to py3 (and the underlying issue with the 'S' data type in numpy...)
One note from the github issue: """ The use of asbytes originates only from the fact that b'%d' % (20,) does not work. """
So yeah PEP 461! (even if too late for numpy...)
Would they use "(u'%d' % (20,)).encode('ascii')" for that? Just curious as to what they're planning on doing. Eric.
On 17 January 2014 21:37, Chris Barker <chris.barker@noaa.gov> wrote:
For the record, we've got a pretty good thread (not this good, though!) over on the numpy list about how to untangle the mess that has resulted from porting text-file-parsing code to py3 (and the underlying issue with the 'S' data type in numpy...)
One note from the github issue: """ The use of asbytes originates only from the fact that b'%d' % (20,) does not work. """
So yeah PEP 461! (even if too late for numpy...)
The discussion about numpy.loadtxt and the 'S' dtype is not relevant to PEP 461. PEP 461 is about facilitating handling ascii/binary protocols and file formats. The loadtxt function is for reading text files. Reading text files is already handled very well in Python 3. The only caveat is that you need to specify the encoding when you open the file. The loadtxt function doesn't specify the encoding when it opens the file so on Python 3 it gets the system default encoding when reading from the file. Since the 'S' dtype is for an array of bytes the loadtxt function has to encode the unicode strings before storing them in the array. The function has no idea what encoding the user wants so it just uses latin-1 leading to mojibake if the file content and encoding are not compatible with latin-1 e.g.: utf-8. The loadtxt function is a classic example of how *not* to do text and whoever made it that way probably didn't understand unicode and the Python 3 text model. If they did understand what they were doing then they knew that they were implementing a dirty hack. If you want to draw a relevant lesson from that thread in this one then the lesson argues against PEP 461: adding back the bytes formatting methods helps people who refuse to understand text processing and continue implementing dirty hacks instead of doing it properly. Oscar
On 19 January 2014 00:39, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
If you want to draw a relevant lesson from that thread in this one then the lesson argues against PEP 461: adding back the bytes formatting methods helps people who refuse to understand text processing and continue implementing dirty hacks instead of doing it properly.
Yes, that's why it has taken so long to even *consider* bringing binary interpolation support back - one of our primary concerns in the early days of Python 3 was developers (including core developers!) attempting to translate bad habits from Python 2 into Python 3 by continuing to treat binary data as text. Making interpolation a purely text domain operation helped strongly in enforcing this distinction, as it generally required thinking about encoding issues in order to get things into the text domain (or hitting them with the "latin-1" hammer, in which case... *sigh*). The reason PEP 460/461 came up is that we *do* acknowledge that there is a legitimate use case for binary interpolation support when dealing with binary formats that contain ASCII compatible segments. Now that people have had a few years to get used to the Python 3 text model , lowering the barrier to migration from Python 2 and better handling that use case in Python 3 in general has finally tilted the scales in favour of providing the feature (assuming Guido is happy with PEP 461 after Ethan finishes the Rationale section). (Tangent) While I agree it's not relevant to the PEP 460/461 discussions, so long as numpy.loadtxt is explicitly documented as only working with latin-1 encoded files (it currently isn't), there's no problem. If it's supposed to work with other encodings (but the entire file is still required to use a consistent encoding), then it just needs encoding and errors arguments to fit the Python 3 text model (with "latin-1" documented as the default encoding). If it is intended to allow S columns to contain text in arbitrary encodings, then that should also be supported by the current API with an adjustment to the default behaviour, since passing something like codecs.getdecoder("utf-8") as a column converter should do the right thing. However, if you're currently decoding S columns with latin-1 *before* passing the value to the converter, then you'll need to use a WSGI style decoding dance instead: def fix_encoding(text): return text.encode("latin-1").decode("utf-8") # For example That's more wasteful than just passing the raw bytes through for decoding, but is the simplest backwards compatible option if you're doing latin-1 decoding already. If different rows in the *same* column are allowed to have different encodings, then that's not a valid use of the operation (since the column converter has no access to the rest of the row to determine what encoding should be used for the decode operation). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 19 January 2014 06:19, Nick Coghlan <ncoghlan@gmail.com> wrote:
While I agree it's not relevant to the PEP 460/461 discussions, so long as numpy.loadtxt is explicitly documented as only working with latin-1 encoded files (it currently isn't), there's no problem.
Actually there is problem. If it explicitly specified the encoding as latin-1 when opening the file then it could document the fact that it works for latin-1 encoded files. However it actually uses the system default encoding to read the file and then converts the strings to bytes with the as_bytes function that is hard-coded to use latin-1: https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 So it only works if the system default encoding is latin-1 and the file content is white-space and newline compatible with latin-1. Regardless of whether the file itself is in utf-8 or latin-1 it will only work if the system default encoding is latin-1. I've never used a system that had latin-1 as the default encoding (unless you count cp1252 as latin-1).
If it's supposed to work with other encodings (but the entire file is still required to use a consistent encoding), then it just needs encoding and errors arguments to fit the Python 3 text model (with "latin-1" documented as the default encoding).
This is the right solution. Have an encoding argument, document the fact that it will use the system default encoding if none is specified, and re-encode using the same encoding to fit any dtype='S' bytes column. This will then work for any encoding including the ones that aren't ASCII-compatible (e.g. utf-16). Then instead of having a compat module with an as_bytes helper to get rid of all the unicode strings on Python 3, you can have a compat module with an open_unicode helper to do the right thing on Python 2. The as_bytes function is just a way of fighting the Python 3 text model: "I don't care about mojibake just do whatever it takes to shut up the interpreter and its error messages and make sure it works for ASCII data."
If it is intended to allow S columns to contain text in arbitrary encodings, then that should also be supported by the current API with an adjustment to the default behaviour, since passing something like codecs.getdecoder("utf-8") as a column converter should do the right thing. However, if you're currently decoding S columns with latin-1 *before* passing the value to the converter, then you'll need to use a WSGI style decoding dance instead:
def fix_encoding(text): return text.encode("latin-1").decode("utf-8") # For example
That's just getting silly IMO. If the file uses mixed encodings then I don't consider it a valid "text file" and see no reason for loadtxt to support reading it.
That's more wasteful than just passing the raw bytes through for decoding, but is the simplest backwards compatible option if you're doing latin-1 decoding already.
If different rows in the *same* column are allowed to have different encodings, then that's not a valid use of the operation (since the column converter has no access to the rest of the row to determine what encoding should be used for the decode operation).
Ditto. Oscar
On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
long as numpy.loadtxt is explicitly documented as only working with latin-1 encoded files (it currently isn't), there's no problem.
Actually there is problem. If it explicitly specified the encoding as latin-1 when opening the file then it could document the fact that it works for latin-1 encoded files. However it actually uses the system default encoding to read the file
which is a really bad default -- oh well. Also, I don't think it was a choice, at least not a well thought out one, but rather what fell out of tryin gto make it "just work" on py3. and then converts the strings to
bytes with the as_bytes function that is hard-coded to use latin-1: https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
So it only works if the system default encoding is latin-1 and the file content is white-space and newline compatible with latin-1. Regardless of whether the file itself is in utf-8 or latin-1 it will only work if the system default encoding is latin-1. I've never used a system that had latin-1 as the default encoding (unless you count cp1252 as latin-1).
even if it was a common default it would be a "bad idea". Fortunately (?), so it really is broken, we can fix it without being too constrained by backwards compatibility.
If it's supposed to work with other encodings (but the entire file is still required to use a consistent encoding), then it just needs encoding and errors arguments to fit the Python 3 text model (with "latin-1" documented as the default encoding).
This is the right solution. Have an encoding argument, document the fact that it will use the system default encoding if none is specified, and re-encode using the same encoding to fit any dtype='S' bytes column. This will then work for any encoding including the ones that aren't ASCII-compatible (e.g. utf-16).
Exactly, except I dont think the system encoding as a default is a good choice. If there is a default MOST people will use it. And it will work for a lot of their test code. Then it will break if the code is passed to a system with a different default encoding, or a file comes from another source in a different encoding. This is very, very likely. Far more likely that files consistently being in the system encoding....
default behaviour, since passing something like codecs.getdecoder("utf-8") as a column converter should do the right thing.
that seems to work at the moment, actually, if done with care. That's just getting silly IMO. If the file uses mixed encodings then I
don't consider it a valid "text file" and see no reason for loadtxt to support reading it.
agreed -- that's just getting crazy -- the only use-case I can image is to clean up a file that got moji-baked by some other process -- not really the use case for loadtxt and friends. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 17/01/2014 10:18 a.m., Terry Reedy wrote:
On 1/16/2014 5:11 AM, Nick Coghlan wrote:
Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data,
Nick's initial arguments against bytes formatting were very abstract and philosophical, along the lines that it violated some pure mental model of text/bytes separation. Then Guido said something that Nick took to be an equal and opposite philosophical argument that cancelled out his original objections, and he withdrew them. I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing. -- Greg
On 01/16/2014 05:32 PM, Greg wrote:
I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing.
And a good thing, too, on both counts! :) A few folks have suggested not implementing .format() on bytes; I've been resistant, but then I remembered that format is also a function. http://docs.python.org/3/library/functions.html?highlight=ascii#format ====================================================================== format(value[, format_spec]) Convert a value to a “formatted” representation, as controlled by format_spec. The interpretation of format_spec will depend on the type of the value argument, however there is a standard formatting syntax that is used by most built-in types: Format Specification Mini-Language. The default format_spec is an empty string which usually gives the same effect as calling str(value). A call to format(value, format_spec) is translated to type(value).__format__(format_spec) which bypasses the instance dictionary when searching for the value’s __format__() method. A TypeError exception is raised if the method is not found or if either the format_spec or the return value are not strings. ====================================================================== Given that, I can relent on .format and just go with .__mod__ . A low-level service for a low-level protocol, what? ;) -- ~Ethan~
On 17 January 2014 11:51, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/16/2014 05:32 PM, Greg wrote:
I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing.
And a good thing, too, on both counts! :)
A few folks have suggested not implementing .format() on bytes; I've been resistant, but then I remembered that format is also a function.
http://docs.python.org/3/library/functions.html?highlight=ascii#format ====================================================================== format(value[, format_spec])
Convert a value to a “formatted” representation, as controlled by format_spec. The interpretation of format_spec will depend on the type of the value argument, however there is a standard formatting syntax that is used by most built-in types: Format Specification Mini-Language.
The default format_spec is an empty string which usually gives the same effect as calling str(value).
A call to format(value, format_spec) is translated to type(value).__format__(format_spec) which bypasses the instance dictionary when searching for the value’s __format__() method. A TypeError exception is raised if the method is not found or if either the format_spec or the return value are not strings. ======================================================================
Given that, I can relent on .format and just go with .__mod__ . A low-level service for a low-level protocol, what? ;)
Exactly - while I'm a fan of the new extensible formatting system and strongly prefer it to printf-style formatting for text, it also has a whole lot of complexity that is hard to translate to the binary domain, including the format() builtin and __format__ methods. Since the relevant use cases appear to be already covered adequately by prinft-style formatting, attempting to translate the flexible text formatting system as well just becomes additional complexity we don't need. I like Stephen Turnbull's suggestion of using "binary formats with ASCII segments" to distinguish the kind of formats we're talking about from ASCII compatible text encodings, and I think Python 3.5 will end up with a suite of solutions that suitably covers all use cases, just by bringing back printf-style formatting directly to bytes: * format(), str.format(), str.format_map(): a rich extensible text formatting system, including date interpolation support * str.__mod__: retained primarily for backwards compatibility, may occasionally be used as a text formatting optimisation tool (since the inflexibility means it will likely always be marginally faster than the rich formatting system for the cases that it covers) * bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify production of data in variable length binary formats that contain ASCII segments * the struct module: rich (but not extensible) formatting system for fixed length binary formats In Python 2, the binary format with ASCII segments use case was intermingled with general purpose text formatting on the str type, which is I think the main reason it has taken us so long to convince ourselves it is something that is genuinely worth bringing back in a more limited form in Python 3, rather than just being something we wanted back because we were used to having it in Python 2. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 1/16/2014 9:46 PM, Nick Coghlan wrote:
On 17 January 2014 11:51, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/16/2014 05:32 PM, Greg wrote:
I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing.
And a good thing, too, on both counts! :)
A few folks have suggested not implementing .format() on bytes; I've been resistant, but then I remembered that format is also a function.
http://docs.python.org/3/library/functions.html?highlight=ascii#format ====================================================================== format(value[, format_spec])
Convert a value to a “formatted” representation, as controlled by format_spec. The interpretation of format_spec will depend on the type of the value argument, however there is a standard formatting syntax that is used by most built-in types: Format Specification Mini-Language.
The default format_spec is an empty string which usually gives the same effect as calling str(value).
A call to format(value, format_spec) is translated to type(value).__format__(format_spec) which bypasses the instance dictionary when searching for the value’s __format__() method. A TypeError exception is raised if the method is not found or if either the format_spec or the return value are not strings. ======================================================================
Given that, I can relent on .format and just go with .__mod__ . A low-level service for a low-level protocol, what? ;) Exactly - while I'm a fan of the new extensible formatting system and strongly prefer it to printf-style formatting for text, it also has a whole lot of complexity that is hard to translate to the binary domain, including the format() builtin and __format__ methods.
Since the relevant use cases appear to be already covered adequately by prinft-style formatting, attempting to translate the flexible text formatting system as well just becomes additional complexity we don't need.
I like Stephen Turnbull's suggestion of using "binary formats with ASCII segments" to distinguish the kind of formats we're talking about from ASCII compatible text encodings,
I liked that too, and almost said so on his posting, but will say it here, instead.
and I think Python 3.5 will end up with a suite of solutions that suitably covers all use cases, just by bringing back printf-style formatting directly to bytes:
* format(), str.format(), str.format_map(): a rich extensible text formatting system, including date interpolation support * str.__mod__: retained primarily for backwards compatibility, may occasionally be used as a text formatting optimisation tool (since the inflexibility means it will likely always be marginally faster than the rich formatting system for the cases that it covers) * bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify production of data in variable length binary formats that contain ASCII segments * the struct module: rich (but not extensible) formatting system for fixed length binary formats
Adding format codes with variable length could enhance the struct module to additional uses. C structs, on which it is modeled, often get around the difficulty of variable length items by defining one variable length item at the end, or by defining offsets in the fixed part, to variable length parts that follows. Such a structure cannot presently be created by struct alone.
In Python 2, the binary format with ASCII segments use case was intermingled with general purpose text formatting on the str type, which is I think the main reason it has taken us so long to convince ourselves it is something that is genuinely worth bringing back in a more limited form in Python 3, rather than just being something we wanted back because we were used to having it in Python 2.
Cheers, Nick.
Greg writes:
I don't think it matters whether the internal details of [the EIBTI vs. PBP] debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing.
I think some of it matters to the documentation.
Greg <greg.ewing@canterbury.ac.nz> wrote:
I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing.
I've been mostly steering clear of the metaphysical and writing code today. ;-) An extremely rough patch has been uploaded: http://bugs.python.org/issue20284 I have a new one almost ready that introduces __ascii__ rather than overloading __format__. I like it better, will upload to issue tracker soon. Regards, Neil
Carl Meyer <carl@oddbird.net> wrote:
I think the PEP could really use a rationale section summarizing _why_ these formatting operations are being added to bytes
I agree. My attempt at re-writing the PEP is below.
In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words::
- duck-typing to allow/reject entry into a byte-stream - no value generated errors
This seems self-contradictory; "isinstance" is type-checking, which is the opposite of duck-typing.
Again, I agree. We should avoid isinstance checks if possible. Abstract ======== This PEP proposes adding %-interpolation to the bytes object. Rational ======== A distruptive but useful change introduced in Python 3.0 was the clean separation of byte strings (i.e. the "bytes" object) from character strings (i.e. the "str" object). The benefit is that character encodings must be explicitly specified and the risk of corrupting character data is reduced. Unfortunately, this separation has made writing certain types of programs more complicated and verbose. For example, programs that deal with network protocols often manipulate ASCII encoded strings. Since the "bytes" type does not support string formatting, extra encoding and decoding between the "str" type is required. For simplicity and convenience it is desireable to introduce formatting methods to "bytes" that allow formatting of ASCII-encoded character data. This change would blur the clean separation of byte strings and character strings. However, it is felt that the practical benefits outweigh the purity costs. The implicit assumption of ASCII-encoding would be limited to formatting methods. One source of many problems with the Python 2 Unicode implementation is the implicit coercion of Unicode character strings into byte strings using the "ascii" codec. If the character strings contain only ASCII characters, all was well. However, if the string contains a non-ASCII character then coercion causes an exception. The combination of implicit coercion and value dependent failures has proven to be a recipe for hard to debug errors. A program may seem to work correctly when tested (e.g. string input that happened to be ASCII only) but later would fail, often with a traceback far from the source of the real error. The formatting methods for bytes() should avoid this problem by not implicitly encoding data that might fail based on the content of the data. Another desirable feature is to allow arbitrary user classes to be used as formatting operands. Generally this is done by introducing a special method that can be implemented by the new class. Proposed semantics for bytes formatting ======================================= Special method __ascii__ ------------------------ A new special method, analogous to __format__, is introduced. This method takes a single argument, a format specifier. The return value is a bytes object. Objects that have an ASCII only representation can implement this method to allow them to be used as format operators. Objects with natural byte representations should implement __bytes__ or the Py_buffer API. %-interpolation --------------- All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers. To avoid having to introduce two special methods, the format specifications will be translated to equivalent __format__ specifiers and __ascii__ method of each argument would be called. Example::
b'%4x' % 10 b' a'
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. Example: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' %s is a restricted in what it will accept:: - input type supports Py_buffer or has __bytes__? use it to collect the necessary bytes (may contain non-ASCII characters) - input type is something else? use its __ascii__ method; if there isn't one, raise TypeErorr Examples: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 b'3.14' >>> b'%4s' % 12 b' 12' >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __ascii__ method, perhaps you need to encode it? .. note:: Because the str type does not have a __ascii__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence:: 'a string'.encode('latin-1') Unsupported % format codes ^^^^^^^^^^^^^^^^^^^^^^^^^^ %r (which calls __repr__) is not supported format ------ The format() method will not be implemented at this time but may be added in a later Python release. The __ascii__ method is designed to make adding it later simpler. Open Questions ============== Do we need to support the complete set of format codes? For complicated formatting perhaps using the str object to do the formatting and encoding the result is sufficient. Should Python check that the bytes returned by __ascii__ are in the range 0-127 (i.e. ASCII)? That seems of little utility since the error would be similar to a unicode-to-str coercion failure in Python 2 and the traceback would normally be far removed from the real error. Built-in types would be designed to never return non-ASCII characters from the __ascii__ method. Proposed variations =================== Instead of introducing a new special method, have numeric types implement __bytes__. - Adding __bytes__ to the int object is not backwards compatible. bytes(<int>) already has an incompatible meaning. It has been suggested to use %b for bytes instead of %s. - Rejected, using %s will making porting code from Python 2 easier. It was suggested to disallow %s from accepting numbers. - Rejected, to ease porting of Python 2 code, %s should accept number operands. It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s. - Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed. It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'"). - Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed. Instead of having %-interpolation call __ascii__, introduce a second special method analogous to __str__ and have %s call it. - Rejected, __ascii__ is both necessary for implementing format() and sufficient for %-interpolation. While implementing a __ascii__ method is more complicated due to the specifier argument, the number of classes which will do so are limited. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
On 1/15/2014 4:13 PM, Ethan Furman wrote:
- no value generated errors ...
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1.
what does x = 354 b"%c" % x produce? Seems that construct produces a value dependent error in both python 2 & 3 (although it takes a much bigger value to produce the error in python 3, with str %... with bytes %, the problem with be reached at 256, just like python 2). Is this an intended exception to the overriding principle?
Surprisingly, in this case the exception is just what the doctor ordered. :-) On Wed, Jan 15, 2014 at 6:12 PM, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 1/15/2014 4:13 PM, Ethan Furman wrote:
- no value generated errors
...
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1.
what does
x = 354 b"%c" % x
produce? Seems that construct produces a value dependent error in both python 2 & 3 (although it takes a much bigger value to produce the error in python 3, with str %... with bytes %, the problem with be reached at 256, just like python 2).
Is this an intended exception to the overriding principle?
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
Glenn Linderman wrote:
x = 354 b"%c" % x
Is this an intended exception to the overriding principle?
I think it's an unavoidable one, unless we want to introduce an "integer in the range 0-255" type. But that would just push the problem into another place, since b"%c" % byte(x) would then blow up on byte(x) if x were out of range. If you really want to make sure it won't crash, you can always do b"%c" % (x & 0xff) or whatever your favourite method of mangling out- of-range ints is. -- Greg
On 01/15/2014 06:12 PM, Glenn Linderman wrote:
On 1/15/2014 4:13 PM, Ethan Furman wrote:
- no value generated errors
...
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1.
what does
x = 354 b"%c" % x
produce? Seems that construct produces a value dependent error in both python 2 & 3 (although it takes a much bigger value to produce the error in python 3, with str %... with bytes %, the problem with be reached at 256, just like python 2).
Is this an intended exception to the overriding principle?
Hmm, thanks for spotting that. Yes, that would be a value error if anything over 255 is used, both currently in Py2, and for bytes in Py3. As Carl suggested, a little more explanation is needed in the PEP. -- ~Ethan~
participants (25)
-
Antoine Pitrou -
Barry Warsaw -
Brett Cannon -
Carl Meyer -
Chris Barker -
Eric Snow -
Eric V. Smith -
Ethan Furman -
Glenn Linderman -
Greg -
Greg Ewing -
Guido van Rossum -
Isaac Morland -
Jan Kaliszewski -
Mark Lawrence -
Michael Urman -
Neil Schemenauer -
Nick Coghlan -
Oscar Benjamin -
Paul Moore -
Serhiy Storchaka -
Stephen J. Turnbull -
Steven D'Aprano -
Terry Reedy -
Yury Selivanov