Here's the text for your reading pleasure. I'll commit the PEP after I add some markup. Major change: - dropped `format` support, just using %-interpolation Coming soon: - Rationale section ;) ================================================================================ PEP: 461 Title: Adding % formatting to bytes Version: $Revision$ Last-Modified: $Date$ Author: Ethan Furman <ethan@stoneleaf.us> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-01-13 Python-Version: 3.5 Post-History: 2014-01-14, 2014-01-15, 2014-01-17 Resolution: Abstract ======== This PEP proposes adding % formatting operations similar to Python 2's str type to bytes [1]_ [2]_. Overriding Principles ===================== In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Py2 code, all object checking will be done by duck-typing, not by values contained in a Unicode representation [3]_. Proposed semantics for bytes formatting ======================================= %-interpolation --------------- All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers. Example:: >>> b'%4x' % 10 b' a' >>> '%#4x' % 10 ' 0xa' >>> '%04X' % 10 '000A' %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1, not from a str. Example: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' %s is restricted in what it will accept:: - input type supports Py_buffer? use it to collect the necessary bytes - input type is something else? use its __bytes__ method; if there isn't one, raise a TypeError Examples: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it? .. note:: Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence:: 'a string'.encode('latin-1') Numeric Format Codes -------------------- To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G). Unsupported codes ----------------- %r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported. Proposed variations =================== It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded. It has been suggested to use %b for bytes instead of %s. - Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s. It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s. - Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed. It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'"). - Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed. Originally this PEP also proposed adding format style formatting, but it was decided that format and its related machinery were all strictly text (aka str) based, and it was dropped. Various new special methods were proposed, such as __ascii__, __format_bytes___, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution. Footnotes ========= .. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting .. [2] neither string.Template, format, nor str.format are under consideration. .. [3] %c is not an exception as neither of its possible arguments are unicode. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: ================================================================================
On Fri, Jan 17, 2014 at 11:49 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
Here's the text for your reading pleasure. I'll commit the PEP after I add some markup.
Major change:
- dropped `format` support, just using %-interpolation
Coming soon:
- Rationale section ;)
============================================================ ==================== PEP: 461 Title: Adding % formatting to bytes Version: $Revision$ Last-Modified: $Date$ Author: Ethan Furman <ethan@stoneleaf.us> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-01-13 Python-Version: 3.5 Post-History: 2014-01-14, 2014-01-15, 2014-01-17 Resolution:
Abstract ========
This PEP proposes adding % formatting operations similar to Python 2's str type to bytes [1]_ [2]_.
Overriding Principles =====================
In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Py2 code, all object checking will be done by duck-typing, not by
Don't abbreviate; spell out "Python 2".
values contained in a Unicode representation [3]_.
Proposed semantics for bytes formatting =======================================
%-interpolation ---------------
All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers.
Example::
b'%4x' % 10 b' a'
'%#4x' % 10 ' 0xa'
'%04X' % 10 '000A'
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1, not from a str.
Example:
>>> b'%c' % 48 b'0'
>>> b'%c' % b'a' b'a'
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
- input type is something else? use its __bytes__ method; if there isn't one, raise a TypeError
Examples:
>>> b'%s' % b'abc' b'abc'
>>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method
>>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
.. note::
Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence::
'a string'.encode('latin-1')
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
Unsupported codes -----------------
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported.
Proposed variations ===================
It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded.
It has been suggested to use %b for bytes instead of %s.
- Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s.
It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.
Originally this PEP also proposed adding format style formatting, but it was
"format-style"
decided that format and its related machinery were all strictly text (aka str) based, and it was dropped.
"that the method and"
Various new special methods were proposed, such as __ascii__, __format_bytes___, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution.
Footnotes =========
.. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting .. [2] neither string.Template, format, nor str.format are under consideration. .. [3] %c is not an exception as neither of its possible arguments are unicode.
+1 from me
On 01/17/2014 08:53 AM, Brett Cannon wrote:
Don't abbreviate; spell out "Python 2".
Fixed.
Originally this PEP also proposed adding format style formatting, but it was
"format-style"
Fixed.
decided that format and its related machinery were all strictly text (aka str) based, and it was dropped.
"that the method and"
Fixed. Thanks. -- ~Ethan~
Ethan Furman <ethan@stoneleaf.us> wrote:
Overriding Principles =====================
In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Py2 code, all object checking will be done by duck-typing, not by values contained in a Unicode representation [3]_.
I think a longer "Rational" section is justified given the amount of discussion this feature generated. Here is a revised version of what I already suggested: Rational ======== A distruptive but useful change introduced in Python 3.0 was the clean separation of byte strings (i.e. the "bytes" object) from character strings (i.e. the "str" object). The benefit is that character encodings must be explicitly specified and the risk of corrupting character data is reduced. Unfortunately, this separation has made writing certain types of programs more complicated and verbose. For example, programs that deal with network protocols often manipulate ASCII encoded strings or assemble byte strings from fragments. Since the "bytes" type does not support string formatting, extra encoding and decoding between the "str" type is often required. For simplicity and convenience it is desireable to introduce formatting methods to "bytes" that allow formatting of ASCII-encoded character data. This change would blur the clean separation of byte strings and character strings. However, it is felt that the practical benefits outweigh the purity costs. The implicit assumption of ASCII-encoding would be limited to formatting methods. One source of many problems with the Python 2 Unicode implementation is the implicit coercion of Unicode character strings into byte strings using the "ascii" codec. If the character strings contain only ASCII characters, all was well. However, if the string contains a non-ASCII character then coercion causes an exception. The combination of implicit coercion and value dependent failures has proven to be a recipe for hard to debug errors. A program may seem to work correctly when tested (e.g. string input that happened to be ASCII only) but later would fail, often with a traceback far from the source of the real error. The formatting methods for bytes() should avoid this problem by not implicitly encoding data that might fail based on the content of the data. I think we can back off on the duck-typing idea. It's a good Python principle but I now realize existing %-interpolation doesn't do it. The numeric format codes coerce to long or float.
Unsupported codes -----------------
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported.
I think %a should be supported. I imagine it would be quite useful when dumping debugging output to a bytes stream. It's easy to implement and I think the danger for abuse or surprises is small. It would also help when translating Python 2 code, change %r to %a.
Proposed variations ===================
It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded.
It has been suggested to use %b for bytes instead of %s.
- Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s.
I think we should use %b instead of %s. In that case, I'm fine with %b not accepting numbers. Using %b clearly indicates we are inserting arbitrary bytes. It also proves a useful code review step when translating from Python 2.x. To ease porting from Python 2.x code, I propose adding a command-line option that enables %s and %r format codes for bytes %-interpolation. I'm going to write a draft PEP (it would depend on PEP 461 being implemented).
Originally this PEP also proposed adding format style formatting, but it was decided that format and its related machinery were all strictly text (aka str) based, and it was dropped.
I would also argue that we should limit the scope of this PEP. It has already generated a massive amount of discussion. Nothing precludes us from adding support for format() to bytes in the future, if we decide we want it and how it should work.
Various new special methods were proposed, such as __ascii__, __format_bytes___, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution.
I agree, new special methods are not needed at this time since numeric codes do use duck-typing and __bytes__ already exists. Neil
On 17/01/2014 17:46, Neil Schemenauer wrote:
I think we should use %b instead of %s. In that case, I'm fine with %b not accepting numbers. Using %b clearly indicates we are inserting arbitrary bytes. It also proves a useful code review step when translating from Python 2.x.
Using %b could cause problems in the future as b is used in new style formatting to mean output numbers in binary, so %B seems to me the obvious choice as it's also unused. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
Using %b could cause problems in the future as b is used in new style formatting to mean output numbers in binary, so %B seems to me the obvious choice as it's also unused.
After updating my patch, I've decided that %s works better. My patch implements PEP 461 as proposed with the following additional features: - add %a format code, calls PyObject_ASCII on the argument. I see no reason not too add it as a useful debugging feature. - add -2 command-line option. When enabled: %s will fallback to calling PyObject_Str() after first trying the buffer API and __bytes__. The value will be encoded using strict ASCII encoding. Also, %r is enabled as an alias for %a. The patch is v4, bugs.python.org/issue20284, still needs more review and testing. Neil
On 1/17/2014 8:49 AM, Ethan Furman wrote:
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
- input type is something else? use its __bytes__ method; if there isn't one, raise a TypeError
Examples:
>>> b'%s' % b'abc' b'abc'
>>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method
>>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
If you produce a helpful error message for str (re: encoding), might it not be appropriate to produce a helpful error message for builtin number types (, perhaps you need a numeric format code?)?
On 01/17/2014 11:40 AM, Glenn Linderman wrote:
On 1/17/2014 8:49 AM, Ethan Furman wrote:
>>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method
If you produce a helpful error message for str (re: encoding), might it not be appropriate to produce a helpful error message for builtin number types (, perhaps you need a numeric format code?)?
Good point! Done. -- ~Ethan~
On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:
Overriding Principles =====================
In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Py2 code, all object checking will be done by duck-typing, not by values contained in a Unicode representation [3]_.
I don't understand this paragraph. What does "values contained in a Unicode representation" mean? [...]
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
Can you give some examples of what types support Py_buffer? Presumably bytes. Anything else?
- input type is something else? use its __bytes__ method; if there isn't one, raise a TypeError
I think you should explicitly state that this is a new special method, and state which built-in types will grow a __bytes__ method (if any).
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
-1 on this idea. This is a rather large violation of the principle of least surprise, and radically different from the behaviour of Python 3 str. In Python 3, '%d' interpolation calls the __str__ method, so if you subclass, you can get the behaviour you want: py> class HexInt(int): ... def __str__(self): ... return hex(self) ... py> "%d" % HexInt(23) '0x17' which is exactly what we should expect from a subclass. You're suggesting that bytes should ignore any custom display implemented by subclasses, and implicitly coerce them to the superclass int. What is the justification for this? You don't define or even describe what you consider "properly handle".
Unsupported codes -----------------
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported.
+1 on not supporting b'%r' (i.e. I agree with the PEP). Why not support b'%a'? That seems to be a strange thing to prohibit. Everythng else, well done and thank you. -- Steven
On 01/17/2014 05:27 PM, Steven D'Aprano wrote:
On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:
Overriding Principles =====================
In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Py2 code, all object checking will be done by duck-typing, not by values contained in a Unicode representation [3]_.
I don't understand this paragraph. What does "values contained in a Unicode representation" mean?
Yeah, that is clunky. I'm trying to convey the idea that we don't want errors based on content, i.e. which characters happens to be in a str.
[...]
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
Can you give some examples of what types support Py_buffer? Presumably bytes. Anything else?
Anybody? Otherwise I'll go spelunking in the code.
- input type is something else? use its __bytes__ method; if there isn't one, raise a TypeError
I think you should explicitly state that this is a new special method, and state which built-in types will grow a __bytes__ method (if any).
It's not new. I know bytes, str, and numbers /do not/ have __bytes__.
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
-1 on this idea.
This is a rather large violation of the principle of least surprise, and radically different from the behaviour of Python 3 str. In Python 3, '%d' interpolation calls the __str__ method, so if you subclass, you can get the behaviour you want:
Did you read the bug reports I linked to? This behavior (which is a bug) has already been fixed for Python3.4. As a quick thought experiment, why does "%d" % True return "1"?
Unsupported codes -----------------
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported.
+1 on not supporting b'%r' (i.e. I agree with the PEP).
Why not support b'%a'? That seems to be a strange thing to prohibit.
I'll admit to being somewhat on the fence about %a. It seems there are two possibilities with %a: 1) have it be ascii(repr(obj)) 2) have it be str(obj).encode('ascii', 'strict') (1) seems only useful for debugging, but even then not very -- if you switch from %s to %a you'll no longer see the bytes output (although you would get the name of the object, which could be handy); (2) is (slightly) blurring the lines between text and encoded-ascii; I would rather see "%s" % text.encode('ascii', 'strict')" So we have two possibilities, both can be useful, I don't know which is most useful or even most logical. So I guess I'm still open to arguments. :)
Everythng else, well done and thank you.
You're welcome! Thank you to everyone who participated. -- ~Ethan~
On 01/17/2014 06:03 PM, Chris Angelico wrote:
On Sat, Jan 18, 2014 at 12:51 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
It seems there are two possibilities with %a:
1) have it be ascii(repr(obj))
Wouldn't that be redundant? ascii() is already repr()-like.
Good point. -- ~Ethan~
On 18 Jan 2014 11:52, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 01/17/2014 05:27 PM, Steven D'Aprano wrote:
On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:
Overriding Principles =====================
In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Py2 code, all object checking will be done by duck-typing, not by values contained in a Unicode representation [3]_.
I don't understand this paragraph. What does "values contained in a Unicode representation" mean?
Yeah, that is clunky. I'm trying to convey the idea that we don't want
errors based on content, i.e. which characters happens to be in a str.
[...]
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
Can you give some examples of what types support Py_buffer? Presumably bytes. Anything else?
Anybody? Otherwise I'll go spelunking in the code.
bytes, bytearray, memoryview, ctypes arrays, array.array, numpy.ndarrray It may actually be clearer to express this in terms of memoryview for the benefits of those that aren't familiar with the C API, as that is the closest equivalent Python level API (while there is an open issue regarding the C only nature of the buffer export API, nobody has volunteered to put together a PEP and implementation for a Python level follow up to the C level PEP 3118. The problem is that the original use cases involve C extensions anyway, so the relevant experts don't have any personal need for a Python level buffer exporter interface. Instead, it's in the "should be done for completeness, and would make some of our testing easier, but doesn't have anyone clamouring for it" bucket.
- input type is something else? use its __bytes__ method; if there isn't one, raise a TypeError
I think you should explicitly state that this is a new special method, and state which built-in types will grow a __bytes__ method (if any).
It's not new. I know bytes, str, and numbers /do not/ have __bytes__.
Right, it is already used by bytes to convert arbitrary objects to a binary representation. The difference with Py_buffer/memoryview is that they provide access to the raw data without necessarily copying anything. str and numbers don't implement it as there's no obvious default interpretation (the b'\x00' * n interpretation of integers is part of the bytes constructor and now a decision we mostly regret - it should have been a keyword argument or a separate class method)
Unsupported codes -----------------
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are
not
supported.
+1 on not supporting b'%r' (i.e. I agree with the PEP).
Why not support b'%a'? That seems to be a strange thing to prohibit.
I'll admit to being somewhat on the fence about %a.
It seems there are two possibilities with %a:
1) have it be ascii(repr(obj))
2) have it be str(obj).encode('ascii', 'strict')
This gets very close to crossing the line into implicit encoding of text again. Binary interpolation is being added back for the specific use case of working with ASCII compatible segments in binary formats, and it's at best arguable that supporting %a will help with that use case. However, without it, there may be a greater temptation to inappropriately define __bytes__ just to support binary interpolation, rather than because a type truly has an appropriate translation directly to bytes. By allowing %a, we avoid that temptation. This is also potentially useful specifically in the case of binary logging formats and as a quick way to request backslash escaping of non-ASCII characters in text. Call it +0.5 for allowing %a. I don't expect it to be used heavily, but I think it will head off a fair bit of potential misuse of __bytes__. Cheers, Nick.
On 01/18/2014 05:48 AM, Nick Coghlan wrote:
On 18 Jan 2014 11:52, "Ethan Furman" wrote:
I'll admit to being somewhat on the fence about %a.
It seems there are two possibilities with %a:
1) have it be ascii(repr(obj))
2) have it be str(obj).encode('ascii', 'strict')
This gets very close to crossing the line into implicit encoding of text again. Binary interpolation is being added back for the specific use case of working with ASCII compatible segments in binary formats, and it's at best arguable that supporting %a will help with that use case.
Agreed.
However, without it, there may be a greater temptation to inappropriately define __bytes__ just to support binary interpolation, rather than because a type truly has an appropriate translation directly to bytes.
True.
By allowing %a, we avoid that temptation. This is also potentially useful specifically in the case of binary logging formats and as a quick way to request backslash escaping of non-ASCII characters in text.
Call it +0.5 for allowing %a. I don't expect it to be used heavily, but I think it will head off a fair bit of potential misuse of __bytes__.
So, if %a is added it would act like: --------- "%a" % some_obj --------- tmp = str(some_obj) res = b'' for ch in tmp: if ord(ch) < 256: res += bytes([ord(ch)] else: res += unicode_escape(ch) --------- where 'unicode_escape' would yield something like "\u0440" ? -- ~Ethan~
Ethan Furman <ethan@stoneleaf.us> wrote:
So, if %a is added it would act like:
--------- "%a" % some_obj --------- tmp = str(some_obj) res = b'' for ch in tmp: if ord(ch) < 256: res += bytes([ord(ch)] else: res += unicode_escape(ch) ---------
where 'unicode_escape' would yield something like "\u0440" ?
My patch on the tracker already implements %a, it's simple. Just call PyObject_ASCII() (same as ascii()) then call PyUnicode_AsLatin1String(s) to convert it to bytes and stick it in. PyObject_ASCII does not return non-ASCII characters, no decode error is possible. We could call _PyUnicode_AsASCIIString(s, "strict") instead if we are afraid for non-ASCII bytes coming out of PyObject_ASCII. Neil
On 01/18/2014 05:21 PM, Neil Schemenauer wrote:
Ethan Furman <ethan@stoneleaf.us> wrote:
So, if %a is added it would act like:
--------- "%a" % some_obj --------- tmp = str(some_obj) res = b'' for ch in tmp: if ord(ch) < 256: res += bytes([ord(ch)] else: res += unicode_escape(ch) ---------
where 'unicode_escape' would yield something like "\u0440" ?
My patch on the tracker already implements %a, it's simple.
Before one implements a patch it is good to know the specifications.
Just call PyObject_ASCII() (same as ascii()) then call PyUnicode_AsLatin1String(s) to convert it to bytes and stick it in. PyObject_ASCII does not return non-ASCII characters, no decode error is possible. We could call _PyUnicode_AsASCIIString(s, "strict") instead if we are afraid for non-ASCII bytes coming out of PyObject_ASCII.
I appreciate that this is the behavior you want, but I'm not sure it's the behavior Nick was describing. -- ~Ethan~
On 19 January 2014 12:34, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/18/2014 05:21 PM, Neil Schemenauer wrote:
Ethan Furman <ethan@stoneleaf.us> wrote:
So, if %a is added it would act like:
--------- "%a" % some_obj --------- tmp = str(some_obj) res = b'' for ch in tmp: if ord(ch) < 256: res += bytes([ord(ch)] else: res += unicode_escape(ch) ---------
where 'unicode_escape' would yield something like "\u0440" ?
My patch on the tracker already implements %a, it's simple.
Before one implements a patch it is good to know the specifications.
A very sound engineering principle :) Neil has the resulting semantics right for what I had in mind, but the faster path to bytes (rather than going through the ASCII builtin) is to do the C level equivalent of: repr(obj).encode("ascii", errors="backslashreplace") That's essentially what the ascii() builtin does, but that operates entirely in the text domain, so (as Neil found) you still need a separate encode step at the end. >>> ascii("è").encode("ascii") b"'\\xe8'" >>> repr("è").encode("ascii", errors="backslashreplace") b"'\\xe8'" b"%a" % "è" should produce the same result as the two examples above. (Code points higher up in the Unicode code space would produce \u and \U escapes as needed, which should already be handled properly by the backslashreplace error handler) One nice thing about this definition is that in the specific case of text input, the transformation can always be reversed by decoding as ASCII and then applying ast.literal_eval(): >>> import ast >>> ast.literal_eval(repr("è").encode("ascii", "backslashreplace").decode("ascii")) 'è' (Please don't use eval() to reverse a transformation like this, as doing so not only makes security engineers cry, it's also likely to make your code vulnerable to all kinds of interesting attacks) As noted earlier in the thread, one key purpose of including this feature is to reduce the likelihood of people inappropriately adding __bytes__ implementations for %s compatibility that look like: def __bytes__(self): # This is unlikely to be a good idea! return repr(self).encode("ascii", errors="backslashreplace") Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Jan 17, 2014 at 05:51:05PM -0800, Ethan Furman wrote:
On 01/17/2014 05:27 PM, Steven D'Aprano wrote:
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
-1 on this idea.
This is a rather large violation of the principle of least surprise, and radically different from the behaviour of Python 3 str. In Python 3, '%d' interpolation calls the __str__ method, so if you subclass, you can get the behaviour you want:
Did you read the bug reports I linked to? This behavior (which is a bug) has already been fixed for Python3.4.
No I didn't. This thread is huge, and it's only one of a number of huge threads about the same "bytes/unicode Python 2/3" stuff. I'm probably not the only person who missed the bug reports you linked to. If these bug reports are relevant to the PEP, you ought to list them in the PEP, and if they aren't relevant, I shan't be reading them *wink* In any case, whether I have succeeded in making the case against this aspect of the PEP or not, I think you should: - explain what you mean by "properly handle" (give an example?); - justify why b'%d' % obj should ignore any relevant overloaded methods in obj; - if there are similar, existing, examples of this (to me) surprising behaviour, you should briefly mention them; - note that there was some opposition to the suggestion; - and explain why the contrary behaviour (i.e. allowing obj to overload b'%d') is not desirable.
As a quick thought experiment, why does "%d" % True return "1"?
I don't know. Perhaps it is a bug? -- Steven
On 01/19/2014 03:37 AM, Steven D'Aprano wrote:
On Fri, Jan 17, 2014 at 05:51:05PM -0800, Ethan Furman wrote:
On 01/17/2014 05:27 PM, Steven D'Aprano wrote:
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
-1 on this idea.
This is a rather large violation of the principle of least surprise, and radically different from the behaviour of Python 3 str. In Python 3, '%d' interpolation calls the __str__ method, so if you subclass, you can get the behaviour you want:
Did you read the bug reports I linked to? This behavior (which is a bug) has already been fixed for Python3.4.
No I didn't. This thread is huge, and it's only one of a number of huge threads about the same "bytes/unicode Python 2/3" stuff. I'm probably not the only person who missed the bug reports you linked to.
Fair point.
If these bug reports are relevant to the PEP, you ought to list them in the PEP, and if they aren't relevant, I shan't be reading them *wink*
<mischievous grin> Well, it seems to me they are more relevant to your misunderstanding of how %d and friends should work rather than to the PEP itself. However, I suppose it possible you're not the only one so affected, so I'll link them in. </mischeivous grin>
In any case, whether I have succeeded in making the case against this aspect of the PEP or not
Not. This was a bug that was fixed long before the PEP came into existence.
As a quick thought experiment, why does "%d" % True return "1"?
I don't know. Perhaps it is a bug?
To summarize a rather long issue, %d and friends are /numeric/ codes; returning non-numeric text is inappropriate. Yes, I realize there are other unicode values than also mean numeric digits, but they do not mean (so far as I know) Decimal digits, or Hexadecimal digits, or Octal digits. (Obviously an ASCII slant going on there.) Now that I've written that down, I think there are, in fact, other scripts that represent a base-10 number system with obviously different glyphs for the numbers.... Well, that means that this PEP just further strengthens the notion that format is for text (as then a custom numeric type could easily override the display even for :d, :h, etc.) and % is for bytes (where such glyphs are not natively representable anyway). -- ~Ethan~
Ethan Furman writes:
Well, that means that this PEP just further strengthens the notion that format is for text (as then a custom numeric type could easily override the display even for :d, :h, etc.) and % is for bytes (where such glyphs are not natively representable anyway).
This argument is specious. Alternative numeric characters just as representable as the ASCII digits are, and in the same way (by defining a bytes <-> str mapping, aka codec). The problem is not that they're non-representable, it's that they're non-ASCII, and the numeric format codes implicitly specify the ASCII numerals when in text as well as when in bytes. There's no technical reason why these features couldn't use EBCDIC or even UTF-16 nowadays. It's purely a convention. But it's a very useful convention, so it's helpful if Python conforms to it. (Note that "{:d}.format(True)" -> '1' works because True *is* an int and so can be d-formatted in principle. It's not an exceptional case. It's a different issue from what you're talking about here.) The problem that EIBTI worries about is that in many places there is a local convention to use not pure ASCII, but a specific ASCII superset. This allows them to take advantage of the common convention of using ASCII for protocol keywords, and at the same time using "legacy" facilities for internal processing of text. Becoming a disadvantage if and when such programs need to communicate with internationalized applications. These PEPs provide a crutch for such crippled software, allowing them to hobble into the House of Python 3. That's obvious, so please don't try to obfuscate it; just declare "consenting adults" and move on.
On 01/19/2014 06:56 PM, Stephen J. Turnbull wrote:
Ethan Furman writes:
Well, that means that this PEP just further strengthens the notion that format is for text (as then a custom numeric type could easily override the display even for :d, :h, etc.) and % is for bytes (where such glyphs are not natively representable anyway).
This argument is specious.
I don't think so. I think it's a good argument for the future of Python code. Mind you, I should probably have said % is primarily for bytes, or even more useful for bytes than for text. The idea being that true text fun stuff requires format, while bytes can only use % for easy formatting.
Alternative numeric characters [are] just as representable as the ASCII digits are, and in the same way (by defining a bytes <-> str mapping, aka codec). The problem is not that they're non-representable, it's that they're non-ASCII, and the numeric format codes implicitly specify the ASCII numerals when in text as well as when in bytes.
Certainly. And you can't change that either. Oh, wait, you can! Define your own! class LocalNum(int): "displays d, i, and u codes in local language" def __format__(self, fmt): # do the fancy stuff so the characters are not ASCII, but whatever # is local here Then you could have your text /and/ your numbers be in your own language. But you can't get that using % unless you always call a custom function and use %s.
(Note that "'{:d}'.format(True)" -> '1' works because True *is* an int and so can be d-formatted in principle. It's not an exceptional case. It's a different issue from what you're talking about here.)
"'{:d}'.format(True)" is not exceptional, you're right. But "'%d' % True" is, and was singled-out in the unicode display code to print as '1' and not as 'True'. (Now all int subclasses behave this way (in 3.4 anyways).) And I think it's the same issue, or at least closely related. If you create a custom number type with the intention of displaying them in the local lingo, you have to use __format__ because % is hard coded to yield digits that map to ASCII.
These PEPs provide a crutch for such crippled software, allowing them to hobble into the House of Python 3.
Very picturesque.
That's obvious, so please don't try to obfuscate it; just declare "consenting adults" and move on.
Lots of features can be abused. That doesn't mean we shouldn't talk about the intended use cases and encourage those. -- ~Ethan~
Ethan Furman writes:
This argument is specious.
I don't think so. I think it's a good argument for the future of Python code.
I agree that restricting bytes '%'-formatting to ASCII is a good idea, but you should base your arguments on a correct description of what's going on. It's not an issue of representability. It's an issue of "we should support this for ASCII because it's a useful, nearly universal convention, and we should not support ASCII supersets because that leads to mojibake."
Then you could have your text /and/ your numbers be in your own language.
My language uses numerals other than those in the ASCII repertoire in a rather stylized way. I can't use __format__ for that, because it depends on context, anyway. Most of the time the digits in the ASCII set are used (especially in tables and the like). I believe that's true for all languages nowadays.
Lots of features can be abused. That doesn't mean we shouldn't talk about the intended use cases and encourage those.
I only objected to claims that issues of "representability" and "what I can do with __format__" support the preferred use cases, not to descriptions of the preferred use cases.
On 01/19/2014 11:10 PM, Stephen J. Turnbull wrote:
Ethan Furman writes:
This argument is specious.
I don't think so. I think it's a good argument for the future of Python code.
I agree that restricting bytes '%'-formatting to ASCII is a good idea, but you should base your arguments on a correct description of what's going on. It's not an issue of representability. It's an issue of "we should support this for ASCII because it's a useful, nearly universal convention, and we should not support ASCII supersets because that leads to mojibake."
Then you could have your text /and/ your numbers be in your own language.
My language uses numerals other than those in the ASCII repertoire in a rather stylized way. I can't use __format__ for that, because it depends on context, anyway. Most of the time the digits in the ASCII set are used (especially in tables and the like). I believe that's true for all languages nowadays.
Lots of features can be abused. That doesn't mean we shouldn't talk about the intended use cases and encourage those.
I only objected to claims that issues of "representability" and "what I can do with __format__" support the preferred use cases, not to descriptions of the preferred use cases.
Thank you. I appreciate your time. -- ~Ethan~
On 01/19/2014 03:37 AM, Steven D'Aprano wrote:
On Fri, Jan 17, 2014 at 05:51:05PM -0800, Ethan Furman wrote:
On 01/17/2014 05:27 PM, Steven D'Aprano wrote:
Numeric Format Codes --------------------
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
-1 on this idea.
I went to add examples to this section of the PEP, and realized I was just describing what Python does anyway. So it doesn't need to be in the PEP. -- ~Ethan~
Steven D'Aprano, 18.01.2014 02:27:
On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:
%s is restricted in what it will accept::
- input type supports Py_buffer? use it to collect the necessary bytes
Can you give some examples of what types support Py_buffer? Presumably bytes. Anything else?
Lots of things: bytes, bytearray, memoryview, array.array, NumPy arrays, just to name a few. Basically anything that wants itself to be representable as a chunk of memory with metadata. It's a very common thing in the Big Data department (although many people wouldn't know that they're actually heavy users of this protocol because they just use NumPy and/or Cython and don't look under the hood). Stefan
Steven D'Aprano <steve@pearwood.info> wrote:
To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).
-1 on this idea.
This is a rather large violation of the principle of least surprise, and radically different from the behaviour of Python 3 str. In Python 3, '%d' interpolation calls the __str__ method, so if you subclass, you can get the behaviour you want:
py> class HexInt(int): ... def __str__(self): ... return hex(self) ... py> "%d" % HexInt(23) '0x17'
which is exactly what we should expect from a subclass.
You're suggesting that bytes should ignore any custom display implemented by subclasses, and implicitly coerce them to the superclass int. What is the justification for this? You don't define or even describe what you consider "properly handle".
The proposed behavior (at least as I understand it and as I've implemented in my proposed patch) matches Python 2 str/unicode and Python 3 str behavior for these codes. If you want to allow subclasses to have control or to use duck-typing, you have to use str and __format__. I'm okay with the limitation, bytes formatting can be simple, limited and fast. Neil
+1 on the technical spec from me. The rationale needs work, but you already know that :) For API consistency, I suggest explicitly noting that bytearray will also support the operation, generating a bytearray result. I also suggest introducing the phrase "ASCII compatible segments in binary formats" somewhere, as the intended use case for *all* the ASCII assuming methods on the bytes and bytearray types, including this new one. Cheers, Nick.
Nick Coghlan writes:
I also suggest introducing the phrase "ASCII compatible segments in binary formats" somewhere,
What is the use case for "ASCII *compatible* segments"? Can't you just say "ASCII segments"? I'm not sure exactly what PEP 461 says at this point, but most of the discussion prescribes .encode('ascii', errors='strict') for implicit interpolation of str. "ASCII compatible" is a term that people consistently to interpret to include the bytes representation of their data. Although the actual rule isn't terribly complex (bytes 0-127 must always have ASCII coded character semantics[1]), AFAIK there are no use cases for that other than encoded text, ie, interpolating str, and nobody wants that done leniently in Python 3. Footnotes: [1] Otherwise you need to analyze the content of data to determine whether "ASCII-compatible" operations are safe to perform. Of course that's possible but it was repeatedly rejected in favor of duck-typing.
On Fri, 17 Jan 2014 08:49:21 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
================================================================================ PEP: 461
There are formatting issues in the HTML rendering, I think the ReST code needs a bit massaging: http://www.python.org/dev/peps/pep-0461/
.. note::
Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence::
s/'string' values/unicode strings/ Regards Antoine.
On 01/18/2014 03:40 AM, Antoine Pitrou wrote:
On Fri, 17 Jan 2014 08:49:21 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
================================================================================ PEP: 461
There are formatting issues in the HTML rendering, I think the ReST code needs a bit massaging: http://www.python.org/dev/peps/pep-0461/
I'm not seeing the problems (could be I don't have enough experience to spot them).
.. note::
Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence::
s/'string' values/unicode strings/
Fixed, thanks. -- ~Ethan~
participants (12)
-
Antoine Pitrou
-
Brett Cannon
-
Chris Angelico
-
Ethan Furman
-
Glenn Linderman
-
Larry Hastings
-
Mark Lawrence
-
Neil Schemenauer
-
Nick Coghlan
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano