RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Hi, bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP. The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args? Then, the following points must be decided to define the complete list of supported features (formatters): * Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error? * format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)`` * Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i`` HTML version of the PEP: http://www.python.org/dev/peps/pep-0460/ Inline copy: PEP: 460 Title: Add bytes % args and bytes.format(args) to Python 3.5 Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner <victor.stinner@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 6-Jan-2014 Python-Version: 3.5 Abstract ======== Add ``bytes % args`` operator and ``bytes.format(args)`` method to Python 3.5. Rationale ========= ``bytes % args`` and ``bytes.format(args)`` have been removed in Python 2. This operator and this method are requested by Mercurial and Twisted developers to ease porting their project on Python 3. Python 3 suggests to format text first and then encode to bytes. In some cases, it does not make sense because arguments are bytes strings. Typical usage is a network protocol which is binary, since data are send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP, POP, FTP are ASCII commands interspersed with binary data. Using multiple ``bytes + bytes`` instructions is inefficient because it requires temporary buffers and copies which are slow and waste memory. Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``. ``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even before the first release of Python 3.0 (see issue #3982). ``struct.pack()`` is incomplete. For example, a number cannot be formatted as decimal and it does not support padding bytes string. Mercurial 2.8 still supports Python 2.4. Needed and excluded features ============================ Needed features * Bytes strings: bytes, bytearray and memoryview types * Format integer numbers as decimal * Padding with spaces and null bytes * "%s" should use the buffer protocol, not str() The feature set is minimal to keep the implementation as simple as possible to limit the cost of the implementation. ``str % args`` and ``str.format(args)`` are already complex and difficult to maintain, the code is heavily optimized. Excluded features: * no implicit conversion from Unicode to bytes (ex: encode to ASCII or to Latin1) * Locale support (``{!n}`` format for numbers). Locales are related to text and usually to an encoding. * ``repr()``, ``ascii()``: ``%r``, ``{!r}``, ``%a`` and ``{!a}`` formats. ``repr()`` and ``ascii()`` are used to debug, the output is displayed a terminal or a graphical widget. They are more related to text. * Attribute access: ``{obj.attr}`` * Indexing: ``{dict[key]}`` * Features of struct.pack(). For example, format a number as 32 bit unsigned integer in network endian. The ``struct.pack()`` can be used to prepare arguments, the implementation should be kept simple. * Features of int.to_bytes(). * Features of ctypes. * New format protocol like a new ``__bformat__()`` method. Since the * list of supported types is short, there is no need to add a new protocol. Other types must be explicitly casted. * Alternate format for integer. For example, ``'{|#x}'.format(0x123)`` to get ``0x123``. It is more related to debug, and the prefix can be easily be written in the format string (ex: ``0x%x``). * Relation with format() and the __format__() protocol. bytes.format() and str.format() are unrelated. Unknown: * Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error? * format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)`` * Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i`` bytes % args ============ Formatters: * ``"%c"``: one byte * ``"%s"``: integer or bytes strings * ``"%20s"`` pads to 20 bytes with spaces (``b' '``) * ``"%020s"`` pads to 20 bytes with zeros (``b'0'``) * ``"%\020s"`` pads to 20 bytes with null bytes (``b'\0'``) bytes.format(args) ================== Formatters: * ``"{!c}"``: one byte * ``"{!s}"``: integer or bytes strings * ``"{!.20s}"`` pads to 20 bytes with spaces (``b' '``) * ``"{!.020s}"`` pads to 20 bytes with zeros (``b'0'``) * ``"{!\020s}"`` pads to 20 bytes with null bytes (``b'\0'``) Examples ======== * ``b'a%sc%s' % (b'b', 4)`` gives ``b'abc4'`` * ``b'a{}c{}'.format(b'b', 4)`` gives ``b'abc4'`` * ``b'%c'`` % 88`` gives ``b'X``' * ``b'%%'`` gives ``b'%'`` Criticisms ========== * The development cost and maintenance cost. * In 3.3 encoding to ascii or latin1 is as fast as memcpy * Developers must work around the lack of bytes%args and bytes.format(args) anyway to support Python 3.0-3.4 * bytes.join() is consistently faster than format to join bytes strings. * Formatting functions can be implemented in a third party module References ========== * `Issue #3982: support .format for bytes <http://bugs.python.org/issue3982>`_ * `Mercurial project <http://mercurial.selenic.com/>`_ * `Twisted project <http://twistedmatrix.com/trac/>`_ * `Documentation of Python 2 formatting (str % args) <http://docs.python.org/2/library/stdtypes.html#string-formatting>`_ * `Documentation of Python 2 formatting (str.format) <http://docs.python.org/2/library/string.html#formatstrings>`_ Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
Hi, On Mon, 6 Jan 2014 14:24:50 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args?
I think we must either implement both or none of them.
Then, the following points must be decided to define the complete list of supported features (formatters):
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error?
Not desirable IMHO. bytes formatting should serve mainly for templating situations (i.e. catenate and insert bytestrings into one another). We cannot start giving text-like semantics to bytes objects without confusing non-experts.
* format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
Yes, bytes formatting must support the same calling conventions as str formatting. BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
* Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i``
No, IMHO. Regards Antoine.
On Tue, Jan 7, 2014 at 12:44 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
Since it derives from the C printf notation, it means "insert string here". The fact that __str__ will be called is secondary to that. I would say it's not a problem for bytes formatting to call __bytes__, or in some other way convert to bytes without calling __str__. Will it be confusing to have bytes and str supporting distinctly different format operations? Might it be better to instead create a separate and very different method on a bytes, just to emphasize the difference? ChrisA
On Tue, 7 Jan 2014 00:54:17 +1100 Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jan 7, 2014 at 12:44 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
Since it derives from the C printf notation, it means "insert string here". The fact that __str__ will be called is secondary to that. I would say it's not a problem for bytes formatting to call __bytes__, or in some other way convert to bytes without calling __str__.
Will it be confusing to have bytes and str supporting distinctly different format operations? Might it be better to instead create a separate and very different method on a bytes, just to emphasize the difference?
The people who want bytes formatting, AFAICT, want something that is reasonably 2.x-compatible. That means using the same method / operator and calling conventions. Regards Antoine.
On Mon, Jan 6, 2014 at 8:59 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 7 Jan 2014 00:54:17 +1100 Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jan 7, 2014 at 12:44 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
Since it derives from the C printf notation, it means "insert string here". The fact that __str__ will be called is secondary to that. I would say it's not a problem for bytes formatting to call __bytes__, or in some other way convert to bytes without calling __str__.
Will it be confusing to have bytes and str supporting distinctly different format operations? Might it be better to instead create a separate and very different method on a bytes, just to emphasize the difference?
The people who want bytes formatting, AFAICT, want something that is reasonably 2.x-compatible. That means using the same method / operator and calling conventions.
Right, but that also doesn't mean that a library from the Cheeseshop couldn't be provided which works around any Python 2/3 differences. But my suspicion is anyone requesting this feature (e.g. Mercurial) want it implemented in C for performance and so some pure Python library to help with this won't get any traction.
On 6 Jan 2014 22:15, "Brett Cannon" <brett@python.org> wrote:
On Mon, Jan 6, 2014 at 8:59 AM, Antoine Pitrou <solipsis@pitrou.net>
wrote:
On Tue, 7 Jan 2014 00:54:17 +1100 Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jan 7, 2014 at 12:44 AM, Antoine Pitrou <solipsis@pitrou.net>
wrote:
BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
Since it derives from the C printf notation, it means "insert string here". The fact that __str__ will be called is secondary to that. I would say it's not a problem for bytes formatting to call __bytes__, or in some other way convert to bytes without calling __str__.
Will it be confusing to have bytes and str supporting distinctly different format operations? Might it be better to instead create a separate and very different method on a bytes, just to emphasize the difference?
The people who want bytes formatting, AFAICT, want something that is reasonably 2.x-compatible. That means using the same method / operator and calling conventions.
Right, but that also doesn't mean that a library from the Cheeseshop couldn't be provided which works around any Python 2/3 differences. But my suspicion is anyone requesting this feature (e.g. Mercurial) want it implemented in C for performance and so some pure Python library to help with this won't get any traction.
Right, but it seems to me that a new helper module that could be made backwards compatible at least as far as 2.6 (if not further) would be more useful for that than a builtin change that won't be available until 2015. I think we have enough experience with Python 3 now to say yes, there are still some significant gaps in the support it offers for wire protocol development. We have been hoping others would volunteer to fill that gap, but it's getting to the point where we need to start thinking about handling it ourselves by providing a hybrid Python/C helper module specifically for wire protocol programming. An encodedstr type wouldn't implicitly interoperate with the builtins (until we finally fix the sequence operand coercion bug in CPython) but could at least handle formatting operations like this. Cheers, Nick.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Mon, Jan 6, 2014 at 9:45 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 6 Jan 2014 22:15, "Brett Cannon" <brett@python.org> wrote:
On Mon, Jan 6, 2014 at 8:59 AM, Antoine Pitrou <solipsis@pitrou.net>
wrote:
On Tue, 7 Jan 2014 00:54:17 +1100 Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jan 7, 2014 at 12:44 AM, Antoine Pitrou <solipsis@pitrou.net>
wrote:
BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
Since it derives from the C printf notation, it means "insert string here". The fact that __str__ will be called is secondary to that. I would say it's not a problem for bytes formatting to call __bytes__, or in some other way convert to bytes without calling __str__.
Will it be confusing to have bytes and str supporting distinctly different format operations? Might it be better to instead create a separate and very different method on a bytes, just to emphasize the difference?
The people who want bytes formatting, AFAICT, want something that is reasonably 2.x-compatible. That means using the same method / operator and calling conventions.
Right, but that also doesn't mean that a library from the Cheeseshop couldn't be provided which works around any Python 2/3 differences. But my suspicion is anyone requesting this feature (e.g. Mercurial) want it implemented in C for performance and so some pure Python library to help with this won't get any traction.
Right, but it seems to me that a new helper module that could be made backwards compatible at least as far as 2.6 (if not further) would be more useful for that than a builtin change that won't be available until 2015. I think we have enough experience with Python 3 now to say yes, there are still some significant gaps in the support it offers for wire protocol development.
True, or at least we should be very clear as to how we expect people to do binary packing in Python 3 (Victor's PEP says struct doesn't work, so should that be fixed, etc.). That will help figure out where the holes are currently.
We have been hoping others would volunteer to fill that gap, but it's getting to the point where we need to start thinking about handling it ourselves by providing a hybrid Python/C helper module specifically for wire protocol programming.
Probably. And it can work around any shortcomings we fix in Python 3.5.
An encodedstr type wouldn't implicitly interoperate with the builtins (until we finally fix the sequence operand coercion bug in CPython) but could at least handle formatting operations like this.
You really want that type, don't you? =) -Brett
Cheers, Nick.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On 6 Jan 2014 22:56, "Brett Cannon" <brett@python.org> wrote:
On Mon, Jan 6, 2014 at 9:45 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 6 Jan 2014 22:15, "Brett Cannon" <brett@python.org> wrote:
On Mon, Jan 6, 2014 at 8:59 AM, Antoine Pitrou <solipsis@pitrou.net>
On Tue, 7 Jan 2014 00:54:17 +1100 Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Jan 7, 2014 at 12:44 AM, Antoine Pitrou <solipsis@pitrou.net>
wrote:
BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
Since it derives from the C printf notation, it means "insert string here". The fact that __str__ will be called is secondary to that. I would say it's not a problem for bytes formatting to call __bytes__, or in some other way convert to bytes without calling __str__.
Will it be confusing to have bytes and str supporting distinctly different format operations? Might it be better to instead create a separate and very different method on a bytes, just to emphasize the difference?
The people who want bytes formatting, AFAICT, want something that is reasonably 2.x-compatible. That means using the same method / operator and calling conventions.
Right, but that also doesn't mean that a library from the Cheeseshop couldn't be provided which works around any Python 2/3 differences. But my suspicion is anyone requesting this feature (e.g. Mercurial) want it implemented in C for performance and so some pure Python library to help with this won't get any traction.
Right, but it seems to me that a new helper module that could be made backwards compatible at least as far as 2.6 (if not further) would be more useful for that than a builtin change that won't be available until 2015. I
wrote: think we have enough experience with Python 3 now to say yes, there are still some significant gaps in the support it offers for wire protocol development.
True, or at least we should be very clear as to how we expect people to
do binary packing in Python 3 (Victor's PEP says struct doesn't work, so should that be fixed, etc.). That will help figure out where the holes are currently.
We have been hoping others would volunteer to fill that gap, but it's
getting to the point where we need to start thinking about handling it ourselves by providing a hybrid Python/C helper module specifically for wire protocol programming.
Probably. And it can work around any shortcomings we fix in Python 3.5.
An encodedstr type wouldn't implicitly interoperate with the builtins
(until we finally fix the sequence operand coercion bug in CPython) but could at least handle formatting operations like this.
You really want that type, don't you? =)
I still don't think the 2.x bytestring is inherently evil, it's just the wrong type to use as the core text type because of the problems it has with silently creating mojibake and also with multi-byte codecs and slicing. The current python-ideas thread is close to convincing me even a stripped down version isn't a good idea, though :P Cheers, Nick.
-Brett
Cheers, Nick.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Is this really a good idea? PEP 460 proposes rather different semantics for bytes.format and the bytes % operator from the str versions. I think this is going to be both confusing and a continuous target for "further improvement" until the two implementations converge. Nick Coghlan writes:
I still don't think the 2.x bytestring is inherently evil, it's just the wrong type to use as the core text type because of the problems it has with silently creating mojibake and also with multi-byte codecs and slicing. The current python-ideas thread is close to convincing me even a stripped down version isn't a good idea, though :P
Lack of it is obviously a major pain point for many developers, but -- it is inherently evil. It's a structured data type passed around as an unstructured blob of memory, with no way for one part of the program to determine what (if anything) another part of the program thinks it's doing. It's the Python equivalent to the pointer type aliasing that gcc likes to whine about. Given that most wire protocols that benefit from this kind of thing are based on ASCII-coded commands and parameters, I think there's a better alternative to either adding 2.x bytestrings as a separate type or to PEP 460. This is to add a (minimal) structure we could call "ASCII-compatible byte array" to the current set of Unicode representations. The detailed proposal is on -ideas (where I call it "7-bit representation", but that has already caused misunderstanding.) This representation would treat non-ASCII bytes as the current representations do bytes encoded as surrogates. This representation would be produced only by a special "ascii-compatible" codec (which implies the surrogateescape- like behavior). It has the following advantages for bytestring-type processing: - double-encoding/decoding is not possible - uninterpreted bytes are marked as such -- they can be compared for equality, but other character manipulations are no-ops. - representation is efficient - output via the 'ascii-compatible' codec is just memcpy - input via the 'ascii-compatible' codec is reasonably efficient (in the posted proposal detection of non-ASCII bytes is required, so it cannot be just memcpy) - str operations are all available; only on I/O is any additional overhead imposed compared to str There's one other possible advantage that I haven't thought through yet: compatibility with 2.x literals (eg, "inputstring.find('To:')" instead of "inputbytes.find(b'To:')"). It probably does impose overhead compared to bytes, especially with the restricted functionality Victor proposes for .format() on bytes, but as Victor points out so does any full-featured string-style processing vs. low-level operations like .join(). I suppose it would be acceptable, except possibly the extra copying for I/O. The main disadvantage is additional complexity in the implementation of the str type. I don't think it imposes much runtime overhead, however, since the checks for different representations when operating on str must be done anyway. Operations involving "ascii-compatible" and other representations at the same time should be rare, except for the combinations of "ascii-compatible" and 8-bit representations -- which just involve copying bytes as between 8-bit and 8-bit, plus a bit of logic to set the type correctly. Steve
On Tue, Jan 07, 2014 at 09:26:20PM +0900, Stephen J. Turnbull wrote:
Is this really a good idea? PEP 460 proposes rather different semantics for bytes.format and the bytes % operator from the str versions. I think this is going to be both confusing and a continuous target for "further improvement" until the two implementations converge.
Reading about the proposed differences reminded me of how in older python2 versions unicode() took keyword arguments but str.decode() only took positional arguments. I squashed a lot of trivial bugs in people's code where that difference wasn't anticpated. In later python2 versions both of those came to understand how to take their arguments as keywords which saved me from further unnecessary pain. -Toshio
On Tue, 7 Jan 2014 00:45:58 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Right, but it seems to me that a new helper module that could be made backwards compatible at least as far as 2.6 (if not further) would be more useful for that than a builtin change that won't be available until 2015.
More useful in the short term, less useful in the long term.
An encodedstr type wouldn't implicitly interoperate with the builtins (until we finally fix the sequence operand coercion bug in CPython) but could at least handle formatting operations like this.
That's a crude hack. Also it doesn't address the situation where you want to interpolate bytestrings without them having any textual significance. Regards Antoine.
On Mon, Jan 6, 2014 at 8:44 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hi,
On Mon, 6 Jan 2014 14:24:50 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args?
I think we must either implement both or none of them.
Or bytes.format() only. But I do agree that only implementing the % operator is the wrong answer. -Brett
Then, the following points must be decided to define the complete list of supported features (formatters):
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error?
Not desirable IMHO. bytes formatting should serve mainly for templating situations (i.e. catenate and insert bytestrings into one another). We cannot start giving text-like semantics to bytes objects without confusing non-experts.
* format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
Yes, bytes formatting must support the same calling conventions as str formatting.
BTW, there's a subtlety here: ``%s`` currently means "insert the result of calling __str__", but bytes formatting should *not* call __str__.
* Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i``
No, IMHO.
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org
On 2014-01-06, at 14:44 , Antoine Pitrou <solipsis@pitrou.net> wrote:
Then, the following points must be decided to define the complete list of supported features (formatters):
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error?
Not desirable IMHO. bytes formatting should serve mainly for templating situations (i.e. catenate and insert bytestrings into one another). We cannot start giving text-like semantics to bytes objects without confusing non-experts.
But having at least some of struct's formatting options available on bytes.format or bytes % would be useful.
On 01/06/2014 09:50 AM, Xavier Morel wrote:
On 2014-01-06, at 14:44 , Antoine Pitrou <solipsis@pitrou.net> wrote:
Then, the following points must be decided to define the complete list of supported features (formatters):
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error?
Not desirable IMHO. bytes formatting should serve mainly for templating situations (i.e. catenate and insert bytestrings into one another). We cannot start giving text-like semantics to bytes objects without confusing non-experts.
But having at least some of struct's formatting options available on bytes.format or bytes % would be useful.
Perhaps, but the PEP's stated goal is to make porting between 2.x and 3.5 easier. Add struct formatting to 3.5 wouldn't help. Eric.
I've just posted about PEP 460 and this discussion on the mercurial-devel mailing list. Tim Delaney
Am 06.01.2014 14:24, schrieb Victor Stinner:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch. cheers, Georg
On 7 January 2014 09:40, Georg Brandl <g.brandl@gmx.net> wrote:
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch.
Will the relevant projects actually support only 2.X and 3.4/5+? If they expect to or have to support 3.2 or 3.3, then this change isn't actually going to help them much. If they will only support versions of Python 3 containing this change, then it may well be worth considering the impact of delaying it till 3.5. Paul.
2014/1/7 Paul Moore <p.f.moore@gmail.com>:
Will the relevant projects actually support only 2.X and 3.4/5+? If they expect to or have to support 3.2 or 3.3, then this change isn't actually going to help them much. If they will only support versions of Python 3 containing this change, then it may well be worth considering the impact of delaying it till 3.5.
Twisted and Mercurial don't support Python 3. (I heard that Twisted Core supports Python 3, but I don't know if it's true nor the Python 3 version.) Victor
Given the low adoption rates for Python 3 it would not surprise me if people who are hampered by the lack of this change are willing to wait until a Python version is released that has it. On Jan 7, 2014, at 5:13 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
2014/1/7 Paul Moore <p.f.moore@gmail.com>:
Will the relevant projects actually support only 2.X and 3.4/5+? If they expect to or have to support 3.2 or 3.3, then this change isn't actually going to help them much. If they will only support versions of Python 3 containing this change, then it may well be worth considering the impact of delaying it till 3.5.
Twisted and Mercurial don't support Python 3.
(I heard that Twisted Core supports Python 3, but I don't know if it's true nor the Python 3 version.)
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/donald%40stufft.io
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On 7 Jan 2014 18:18, "Donald Stufft" <donald@stufft.io> wrote:
Given the low adoption rates for Python 3 it would not surprise me if
people
who are hampered by the lack of this change are willing to wait until a Python version is released that has it.
Once the code exists (regardless of the exact spelling), it also becomes much easier to extract as an extension module on PyPI for wire protocol formatting. That would allow folks to choose between just supporting 3.5+ and using the builtin formatting operations, or switching to the cross version compatible formatting module (if one was created). So I like the idea of restoring this capability for 3.5, but don't see a reason to consider rushing it into 3.4. Cheers, Nick.
On Jan 7, 2014, at 5:13 AM, Victor Stinner <victor.stinner@gmail.com>
wrote:
2014/1/7 Paul Moore <p.f.moore@gmail.com>:
Will the relevant projects actually support only 2.X and 3.4/5+? If they expect to or have to support 3.2 or 3.3, then this change isn't actually going to help them much. If they will only support versions of Python 3 containing this change, then it may well be worth considering the impact of delaying it till 3.5.
Twisted and Mercurial don't support Python 3.
(I heard that Twisted Core supports Python 3, but I don't know if it's true nor the Python 3 version.)
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/donald%40stufft.io
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372
DCFA
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Jan 07, 2014, at 05:16 AM, Donald Stufft wrote:
Given the low adoption rates for Python 3 it would not surprise me if people who are hampered by the lack of this change are willing to wait until a Python version is released that has it.
If that means waiting until 3.5, then I disagree. The Python interpreter is the lowest rung of the food chain, so there's a natural delay in having required support percolate up. Imposing another 18 month delay would be unfortunate. (Obviously, if technical matters prevent it, that's another thing.) -Barry
On Tue, 7 Jan 2014 05:16:18 -0500 Donald Stufft <donald@stufft.io> wrote:
Given the low adoption rates for Python 3
It would be nice not repeating that mantra since there are no reliable usage figures available. Regards Antoine.
On Jan 07, 2014, at 11:13 AM, Victor Stinner wrote:
Twisted and Mercurial don't support Python 3.
(I heard that Twisted Core supports Python 3, but I don't know if it's true nor the Python 3 version.)
Parts of Twisted do run on Python 3 (and are even available in Ubuntu), but if PEP 460 helps speed up the transition of the rest of the suite, I'm all for trying to squeeze it into 3.4. -Barry
Am 07.01.2014 10:59, schrieb Paul Moore:
On 7 January 2014 09:40, Georg Brandl <g.brandl@gmx.net> wrote:
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch.
Will the relevant projects actually support only 2.X and 3.4/5+? If they expect to or have to support 3.2 or 3.3, then this change isn't actually going to help them much. If they will only support versions of Python 3 containing this change, then it may well be worth considering the impact of delaying it till 3.5.
Yes, exactly. Another, and probably better, proposal would be to make 3.5 the "ultimate" viable porting target: we now know pretty well what the major remaining roadblocks (real and perceived) are for our developers and users. The proposal would be to focus entirely on addressing these roadblocks in the 3.5 version, and no other new features -- the release cycle needn't be 18 months for this one. This is similar to the moratorium for 3.2, but that one came too early for 3.x porting to really profit. In short, I am increasingly concerned that although we are going a pretty good way (and Nick's FAQ list makes that much clearer than anything else I've read), but it is not perceived as such, and could be better. We have brought Python 3 on the community, and as such we need to make it very very clear that we are working with them, not against them. A minor release dedicated to that end should be a very direct representation of that. I know about the "release everything to PyPI" strategy, but it just doesn't have the same impact. It would be very cool to have multiple projects working together with us for this, and at the release of 3.5 final, present (say) a Mercurial that works on 2.5 and 3.5. Mostly pipe-dreams though... Georg
On Tue, 07 Jan 2014 11:33:55 +0100 Georg Brandl <g.brandl@gmx.net> wrote:
The proposal would be to focus entirely on addressing these roadblocks in the 3.5 version, and no other new features -- the release cycle needn't be 18 months for this one. This is similar to the moratorium for 3.2, but that one came too early for 3.x porting to really profit.
The moratorium was for alternate Python implementations IIRC, not for porting third-party libraries.
It would be very cool to have multiple projects working together with us for this, and at the release of 3.5 final, present (say) a Mercurial that works on 2.5 and 3.5.
You seem to be forgetting that we are only one part of the equation here. Unless you want to tackle Mercurial and Twisted porting yourself? Good luck with that. Regards Antoine.
Am 07.01.2014 12:16, schrieb Antoine Pitrou:
On Tue, 07 Jan 2014 11:33:55 +0100 Georg Brandl <g.brandl@gmx.net> wrote:
The proposal would be to focus entirely on addressing these roadblocks in the 3.5 version, and no other new features -- the release cycle needn't be 18 months for this one. This is similar to the moratorium for 3.2, but that one came too early for 3.x porting to really profit.
The moratorium was for alternate Python implementations IIRC, not for porting third-party libraries.
Yes, but this would be a similar moratorium with another purpose.
It would be very cool to have multiple projects working together with us for this, and at the release of 3.5 final, present (say) a Mercurial that works on 2.5 and 3.5.
You seem to be forgetting that we are only one part of the equation here. Unless you want to tackle Mercurial and Twisted porting yourself? Good luck with that.
No no, I did not forget :) that's why I wrote "working together with them". It would need to be coordinated with the external projects, but from what I've seen there are willing people. Georg
On Tue, 07 Jan 2014 10:40:15 +0100 Georg Brandl <g.brandl@gmx.net> wrote:
Am 06.01.2014 14:24, schrieb Victor Stinner:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch.
From what I've seen of the unicode formatting code, a lot would have to be rewritten or refactored. It is a non-trivial task, definitely inappropriate for 3.4. Regards Antoine.
Antoine Pitrou <solipsis@pitrou.net> wrote:
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch.
From what I've seen of the unicode formatting code, a lot would have to be rewritten or refactored. It is a non-trivial task, definitely inappropriate for 3.4.
I do not know the stringlib well enough, so I have a silly question: Would it be possible to re-use the 2.x stringlib just for the bytes type, name it byteslib and disable features as appropriate? Stefan Krah
On 01/07/2014 06:24 AM, Stefan Krah wrote:
Antoine Pitrou <solipsis@pitrou.net> wrote:
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch.
From what I've seen of the unicode formatting code, a lot would have to be rewritten or refactored. It is a non-trivial task, definitely inappropriate for 3.4.
I do not know the stringlib well enough, so I have a silly question:
Would it be possible to re-use the 2.x stringlib just for the bytes type, name it byteslib and disable features as appropriate?
I do know it pretty well. I think reusing stringlib from either 2.x or 3.x pre-PEP-393 version would be the best way to go about this. Unfortunately, reusing (or sharing) the PEP-393 version currently in 3.4 is probably not realistic. Eric.
On Jan 07, 2014, at 10:40 AM, Georg Brandl wrote:
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch.
I think we should be willing to entertain breaking feature freeze for getting this in Python 3.4. It's a serious enough problem, and Python 3.4 will be fairly widely distributed. For example, it will be a supported version in the next Debian release and in Ubuntu 14.04 LTS, and *possibly* the default Python 3 version. However, I think we'd need to see how disruptive the code changes are first, and get good review of any proposed patches. Larry and Guido would have to be on board with the exemption as well. If adopted for Python 3.4, PEP 460 should be modest in its goals, but I think I'd still like to see the following excluded and unknown features added: * Attribute access: {obj.attr} * Indexing: {dict[key]} * format keywords? b'{arg}'.format(arg=5) * str % dict ? b'%(arg)s' % {'arg': 5) These are just lookup mechanisms for finding the wanted interpolation value and don't have encoding or conversion effects. Cheers, -Barry
On Tue, Jan 7, 2014 at 2:46 PM, Barry Warsaw <barry@python.org> wrote:
I think we should be willing to entertain breaking feature freeze for getting this in Python 3.4.
Maybe you could revert 3.4 to alpha status and give it a cycle or two there to get this done before returning to beta status. Skip
On 07/01/2014 22:11, Skip Montanaro wrote:
On Tue, Jan 7, 2014 at 2:46 PM, Barry Warsaw <barry@python.org> wrote:
I think we should be willing to entertain breaking feature freeze for getting this in Python 3.4.
Maybe you could revert 3.4 to alpha status and give it a cycle or two there to get this done before returning to beta status.
Skip
When I first saw the suggestion from Georg I had visions of men in white coats gragging him off :) Having giving the idea more thought I think there's any opportunity here and now to make a very profound long term impact for Python 3. Skip's idea seems to me a clean way to do this. Short term pain, long term gain? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
On Tue, 7 Jan 2014 15:46:50 -0500 Barry Warsaw <barry@python.org> wrote:
If adopted for Python 3.4, PEP 460 should be modest in its goals, but I think I'd still like to see the following excluded and unknown features added:
* Attribute access: {obj.attr} * Indexing: {dict[key]} * format keywords? b'{arg}'.format(arg=5) * str % dict ? b'%(arg)s' % {'arg': 5)
I don't think integer values should be supported. Regards Antoine.
On Jan 07, 2014, at 11:13 PM, Antoine Pitrou wrote:
On Tue, 7 Jan 2014 15:46:50 -0500 Barry Warsaw <barry@python.org> wrote:
If adopted for Python 3.4, PEP 460 should be modest in its goals, but I think I'd still like to see the following excluded and unknown features added:
* Attribute access: {obj.attr} * Indexing: {dict[key]} * format keywords? b'{arg}'.format(arg=5) * str % dict ? b'%(arg)s' % {'arg': 5)
I don't think integer values should be supported.
Sorry, the point I was making was about the interpolation and lookup features, not the specific values. -Barry
On Tue, Jan 7, 2014, at 12:46 PM, Barry Warsaw wrote:
On Jan 07, 2014, at 10:40 AM, Georg Brandl wrote:
Very nice, thanks. If I was to make a blasphemous suggestion I would even target it for Python 3.4. (No, seriously, this is a big issue - see the recent discussion by Armin - and the big names involved show that it is a major holdup of 3.x uptake.) It would of course depend a lot on how much code from unicode formatting can be retained or adapted as opposed to a rewrite from scratch.
I think we should be willing to entertain breaking feature freeze for getting this in Python 3.4. It's a serious enough problem, and Python 3.4 will be fairly widely distributed. For example, it will be a supported version in the next Debian release and in Ubuntu 14.04 LTS, and *possibly* the default Python 3 version. However, I think we'd need to see how disruptive the code changes are first, and get good review of any proposed patches. Larry and Guido would have to be on board with the exemption as well.
I agree. This is a very important, much-requested feature for low-level networking code.
If adopted for Python 3.4, PEP 460 should be modest in its goals, but I think I'd still like to see the following excluded and unknown features added:
* Attribute access: {obj.attr} * Indexing: {dict[key]} * format keywords? b'{arg}'.format(arg=5) * str % dict ? b'%(arg)s' % {'arg': 5)
Yes, I don't think we need to support very much of the formatting language cover 99.8% of formating cases for bytes.
Benjamin Peterson writes:
I agree. This is a very important, much-requested feature for low-level networking code.
I hear it's much-requested, but is there any description of typical use cases? The ones I've seen on this list and on -ideas are typically stream-oriented, and seem like they would be perfectly well-served in terms of code readability and algorithmic accuracy by reading with .decode('ascii', errors='surrogateescape') and writing with .encode() and the same parameters (or as latin1).
Yes, I don't think we need to support very much of the formatting language cover 99.8% of formating cases for bytes.
And the other 0.02% will be continuous excuses for RFEs and gratuitous bugs in rarely used format specs and ports from str processing to bytes processing.
On Wed, 08 Jan 2014 13:51:36 +0900 "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Benjamin Peterson writes:
I agree. This is a very important, much-requested feature for low-level networking code.
I hear it's much-requested, but is there any description of typical use cases? The ones I've seen on this list and on -ideas are typically stream-oriented, and seem like they would be perfectly well-served in terms of code readability and algorithmic accuracy by reading with .decode('ascii', errors='surrogateescape') and writing with .encode() and the same parameters (or as latin1).
It's a matter of convenience. Sometimes you're just interpolating bytes data together and it's a bit suboptimal to have to do a decode()-encode() dance around that. That said, the whole issue is slightly overblown as well: network programming in 3.x is perfectly reasonable, as the existence of Tornado and Tulip shows. Regards Antoine.
On Jan 08, 2014, at 01:51 PM, Stephen J. Turnbull wrote:
Benjamin Peterson writes:
I agree. This is a very important, much-requested feature for low-level networking code.
I hear it's much-requested, but is there any description of typical use cases?
The two unported libraries that are preventing me from switching Mailman 3 to Python 3 are restish and storm. For storm, there's a viable alternative in SQLAlchemy though I haven't looked at how difficult it will be to port the model layer (even though we once did use SA). restish is tougher. I've investigated flask, pecan, wsme, and a few others that already have Python 3 support and none of them provide an API that I consider as nice a fit as restish for our standalone WSGI-based REST admin server. That's not to denigrate those other projects, it's just that I think restish hit the sweet spot, and porting Mailman 3 to some other framework so far has proven unworkable (I've tried with each of them). restish is plumbing so I think it's a good test case for Nick's observations of a wire-protocol layer library, and it's obvious that it Just Works in Python 2 but does work at all in Python 3. There have been at least 3 attempts to port restish to Python 3 and all of them get stuck in various places where you actually *can't* decide whether some data structure should be a bytes or str. Make one choice and you get stuck over here, make the other chose and you get stuck over there. I've got two abandoned branches on github with (rather old) porting attempts, and I know other developers have some branches as well. Having given up on trying to switch to a different framework, I'm starting over again with restish (really, it's wonderful :). I plan on keeping more detailed notes this time specifically so that I can help contribute to this discussion. If anybody wants to pitch in, both for the specific purpose of porting the library, and for the more general insights it could provide for this thread, please get in touch. Cheers, -Barry
(Resending with an adjusted Subject and not through Gmane. Apologies for duplicates.) On Jan 08, 2014, at 01:51 PM, Stephen J. Turnbull wrote:
Benjamin Peterson writes:
I agree. This is a very important, much-requested feature for low-level networking code.
I hear it's much-requested, but is there any description of typical use cases?
The two unported libraries that are preventing me from switching Mailman 3 to Python 3 are restish and storm. For storm, there's a viable alternative in SQLAlchemy though I haven't looked at how difficult it will be to port the model layer (even though we once did use SA). restish is tougher. I've investigated flask, pecan, wsme, and a few others that already have Python 3 support and none of them provide an API that I consider as nice a fit as restish for our standalone WSGI-based REST admin server. That's not to denigrate those other projects, it's just that I think restish hit the sweet spot, and porting Mailman 3 to some other framework so far has proven unworkable (I've tried with each of them). restish is plumbing so I think it's a good test case for Nick's observations of a wire-protocol layer library, and it's obvious that it Just Works in Python 2 but does work at all in Python 3. There have been at least 3 attempts to port restish to Python 3 and all of them get stuck in various places where you actually *can't* decide whether some data structure should be a bytes or str. Make one choice and you get stuck over here, make the other chose and you get stuck over there. I've got two abandoned branches on github with (rather old) porting attempts, and I know other developers have some branches as well. Having given up on trying to switch to a different framework, I'm starting over again with restish (really, it's wonderful :). I plan on keeping more detailed notes this time specifically so that I can help contribute to this discussion. If anybody wants to pitch in, both for the specific purpose of porting the library, and for the more general insights it could provide for this thread, please get in touch. Cheers, -Barry
Most popular formatting codes in Mercurial sources: 2519 %s 493 %d 102 %r 48 %Y 47 %M 41 %H 39 %S 38 %m 33 %i 29 %b 23 %ld 19 %ln 12 %.3f 10 %a 10 %.1f 9 %(val)r 9 %p 9 %.2f 8 %I 6 %n 5 %(val)s 5 %.0f 5 %02x 4 %f 4 %c 4 %12s 3 %(user)s 3 %(id)s 3 %h 3 %(bzdir)s 3 %0.2f 3 %02d
+1 I have always been delighted that it is possible to manipulate binary data in Python using string operations. It's not just immoral non-Unicode text processing. A poor man's ASN.1 generator is an example of a very non-text thing that might be convenient to write with a few %s fill-in-the-blanks. Isn't it true that if you have bytes > 127 or surrogate escapes then encoding to latin1 is no longer as fast as memcpy? On Tue, Jan 7, 2014 at 8:22 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Most popular formatting codes in Mercurial sources:
2519 %s 493 %d 102 %r 48 %Y 47 %M 41 %H 39 %S 38 %m 33 %i 29 %b 23 %ld 19 %ln 12 %.3f 10 %a 10 %.1f 9 %(val)r 9 %p 9 %.2f 8 %I 6 %n 5 %(val)s 5 %.0f 5 %02x 4 %f 4 %c 4 %12s 3 %(user)s 3 %(id)s 3 %h 3 %(bzdir)s 3 %0.2f 3 %02d
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
Am 07.01.14 15:08, schrieb Daniel Holth:
Isn't it true that if you have bytes > 127 or surrogate escapes then encoding to latin1 is no longer as fast as memcpy?
You mean "decoding from latin1" (i.e. bytes to string)? No, the opposite is true. It couldn't use memcpy before, but does now (see _PyUnicode_FromUCS1). Regards, Martin
Daniel Holth writes:
Isn't it true that if you have bytes > 127 or surrogate escapes then encoding to latin1 is no longer as fast as memcpy?
Be careful. As phrased, the question makes no sense. You don't "have bytes" when you are encoding, you have characters. If you mean "what happens when my str contains characters in the range 128-255?", the answer is encoding a str in 8-bit representation to latin1 is effectively memcpy. If you read in latin1, it's memcpy all the way (unless you combine it with a non-latin1 string, in which case you're in the cases below). If you mean "what happens when my str contains characters in the range
255", you have to truncate 16-bit units to 8 bit units; no memcpy.
Surrogates require >= 16 bits; no memcpy.
On Tue, Jan 7, 2014 at 10:36 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Daniel Holth writes:
Isn't it true that if you have bytes > 127 or surrogate escapes then encoding to latin1 is no longer as fast as memcpy?
Be careful. As phrased, the question makes no sense. You don't "have bytes" when you are encoding, you have characters.
If you mean "what happens when my str contains characters in the range 128-255?", the answer is encoding a str in 8-bit representation to latin1 is effectively memcpy. If you read in latin1, it's memcpy all the way (unless you combine it with a non-latin1 string, in which case you're in the cases below).
If you mean "what happens when my str contains characters in the range
255", you have to truncate 16-bit units to 8 bit units; no memcpy.
Surrogates require >= 16 bits; no memcpy.
That is neat.
On 01/07/2014 02:22 PM, Serhiy Storchaka wrote:
Most popular formatting codes in Mercurial sources:
2519 %s 493 %d 102 %r 48 %Y 47 %M 41 %H 39 %S 38 %m 33 %i 29 %b [...]
Are you sure you're not including str[fp]time formats in the count?
Victor Stinner, 06.01.2014 14:24:
``struct.pack()`` is incomplete. For example, a number cannot be formatted as decimal and it does not support padding bytes string.
Then what about extending the struct module in a way that makes it cover more use cases like these? Stefan
2014/1/7 Stefan Behnel <stefan_ml@behnel.de>:
Victor Stinner, 06.01.2014 14:24:
``struct.pack()`` is incomplete. For example, a number cannot be formatted as decimal and it does not support padding bytes string.
Then what about extending the struct module in a way that makes it cover more use cases like these?
The idea of the PEP is to simply the portage work of Twisted and Mercurial developers. So the same code should work on Python 2 and Python 3. Extending struct features would not help. This is like adding a new type or function in a third-party module, it requires also to modify the source code for Python 2. And struct.pack() does not even support "%s", the current format for bytes strings requires to specify the length of the string in the format. Juraj Sukop asked me privately to support floating points in the PEP 460 for its PDF generator. Would you really like to add many features to the struct module? Padding, format as integer as decimal (maybe also binary, octal and hexadecimal), format floatting points as decimal, etc.? Victor
Victor Stinner, 07.01.2014 19:14:
2014/1/7 Stefan Behnel:
Victor Stinner, 06.01.2014 14:24:
``struct.pack()`` is incomplete. For example, a number cannot be formatted as decimal and it does not support padding bytes string.
Then what about extending the struct module in a way that makes it cover more use cases like these?
The idea of the PEP is to simply the portage work of Twisted and Mercurial developers. So the same code should work on Python 2 and Python 3.
Is it really a requirement that existing Py2 code must work unchanged in Py3? Why can't someone write a third-party library that does what these projects need, and that works in both Py2 and Py3, so that these projects can be modified to use that library and thus get on with their porting to Py3? Or rather one library that does what some projects need and another one that does what other projects need, because it's quite likely that the requirements are not really as largely identical as it seems when seen through the old and milky Py2 glasses. One idea of designing a Py3 was to simplify the language. Getting all Py2 "features" back in doesn't help on that path. If something can easily be done in an external module, I think it should be done there. Stefan
On 06/01/14 13:24, Victor Stinner wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and [snip]
I'm opposed to adding methods to bytes for this, as I think it goes against the reason for the separation of str and bytes in the first place. str objects are pieces of text, a list of unicode characters. In other words they have meaning independent of their context. bytes are just a sequence of 8bit clumps. The meaning of bytes depends on the encoding, but the proposed methods will have no encoding, but presume meaning. What does b'%s' % 7 do? u'%s' % 7 calls 7 .__str__() which returns a (unicode) string. By implication b'%s' % 7 would call 7 .__str__() and ... And then what? Use the "default" encoding? ASCII? Explicit is better than implicit. I am not opposed to adding new functionality, as long as it is not overloading the % operator or format() method. binascii.format() perhaps? Cheers, Mark.
Hi, 2014/1/8 Mark Shannon <mark@hotpy.org>:
I'm opposed to adding methods to bytes for this, as I think it goes against the reason for the separation of str and bytes in the first place.
Well, sometimes practicability beats purity. Many developers complained that Python 3 is too string. The motivation of the PEP is to ease the transition from Python 2 to Python 3 and be able to write the same code base for the two versions.
bytes are just a sequence of 8bit clumps. The meaning of bytes depends on the encoding, but the proposed methods will have no encoding, but presume meaning.
Many protocols mix ASCII text with binary bytes. For example, an HTTP server writes headers and then copy the content of a binary file (ex: PNG picture, gzipped HTML page, whatever) *in the same stream*. There are many similar examples. Just another one: PDF mixes ASCII text with binary.
What does b'%s' % 7 do?
See Examples of the PEP: b'a%sc%s' % (b'b', 4) gives b'abc4' (so b'%s' % 7 gives b'7')
u'%s' % 7 calls 7 .__str__() which returns a (unicode) string. By implication b'%s' % 7 would call 7 .__str__() and ...
Why do you think do? bytes and str will have two separated implementations, but might share some functions. CPython already has a "stringlib" which shares as much code as possible between bytes and str. For example, the "fastsearch" code is shared.
And then what? Use the "default" encoding? ASCII?
Bytes have no encoding. There are just bytes :-) IMO the typical usecase will by b'%s: %s' % (b'Header', binary_data)
I am not opposed to adding new functionality, as long as it is not overloading the % operator or format() method.
Ok, I will record your oppisition in the PEP.
binascii.format() perhaps?
Please read the Rationale of the PEP again, binascii.format() doesn't solve the described use case. Victor
On Wed, 8 Jan 2014 11:02:19 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
What does b'%s' % 7 do?
See Examples of the PEP:
b'a%sc%s' % (b'b', 4) gives b'abc4'
[...]
And then what? Use the "default" encoding? ASCII?
Bytes have no encoding. There are just bytes :-)
Therefore you shouldn't accept integers. It does not make sense to format 4 as b'4'.
IMO the typical usecase will by b'%s: %s' % (b'Header', binary_data)
Agreed. Regards Antoine.
On 01/08/2014 02:28 AM, Antoine Pitrou wrote:
On Wed, 8 Jan 2014 11:02:19 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
What does b'%s' % 7 do?
See Examples of the PEP:
b'a%sc%s' % (b'b', 4) gives b'abc4'
[...]
And then what? Use the "default" encoding? ASCII?
Bytes have no encoding. There are just bytes :-)
Therefore you shouldn't accept integers. It does not make sense to format 4 as b'4'.
Agreed. I would have that it would result in b'\x04'. -- ~Ethan~
2014/1/8 Ethan Furman <ethan@stoneleaf.us>:
Therefore you shouldn't accept integers. It does not make sense to format 4 as b'4'.
Agreed. I would have that it would result in b'\x04'.
The PEP proposes b'%c' % 4 => b'\x04. Antoine gave me a good argument against supporting b'%s' % int: how would int subclasses be handled? int has no __bytes__() nor __bformat__() method. bytes(int) returns a string of null bytes. I'm maybe simpler to only support %s format with bytes-like objects (bytes, bytearray, memoryview). Victor
On 06.01.2014 14:24, Victor Stinner wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args?
+1 on doing all of this. I'd simply copy over the Python 2 PyString code and start working from there. Readding these features makes live a lot easier in situations where you have to work on data which is encoded text using multiple (sometimes even unknown) encodings in a single data chunk. Think MIME messages, mbox files, diffs, etc. In such situations you often know the encoding of the part you're working on (in most cases ASCII), but not necessarily the encodings of other parts of the chunks. You could work around this by decoding from Latin-1, then using Unicode methods and encoding back to Latin-1, but the risk of letting Mojibake enter your application in uncontrolled ways are high. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 08 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Hi, 2014/1/8 M.-A. Lemburg <mal@egenix.com>:
I'd simply copy over the Python 2 PyString code and start working from there.
It's not possible to reuse directly all Python 2 code because some helpers have been modified to work on Unicode. The PEP 460 adds also more work to other implementations of Python. IMO some formatting commands must not be implemented. For example, alignment is used to display something on screen, not in network protocols or binary file formats. It's also why the issue #3982 was stuck, we must define exactly the feature set of the new methods (bytes % args, bytes.format). Victor
On Wed, Jan 8, 2014 at 9:12 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
IMO some formatting commands must not be implemented. For example, alignment is used to display something on screen, not in network protocols or binary file formats.
Must not, or need not? I can understand that those sorts of features would be less valuable, but they do make sense. ChrisA
On 08.01.2014 11:12, Victor Stinner wrote:
Hi,
2014/1/8 M.-A. Lemburg <mal@egenix.com>:
I'd simply copy over the Python 2 PyString code and start working from there.
It's not possible to reuse directly all Python 2 code because some helpers have been modified to work on Unicode. The PEP 460 adds also more work to other implementations of Python.
IMO some formatting commands must not be implemented. For example, alignment is used to display something on screen, not in network protocols or binary file formats. It's also why the issue #3982 was stuck, we must define exactly the feature set of the new methods (bytes % args, bytes.format).
I'd use practicality beats purity here. As I mentioned in my reply, such formatting methods would indeed be used on data that is text. It's just that this text would be embedded inside an otherwise binary blob. You could do the alignment in Unicode first, then encode it and format it into the binary blob, but really: why bother with that extra round-trip ? The main purpose of the readdition would be to simplify porting applications to Python 3, while keeping them compatible with Python 2 as well. If you need to do the Unicode round-trip just to align a string in some fixed sized field, you might as well convert the whole operation to a function which deals with all this based on whether Python 2 or 3 is running and you'd lose the intended simplification of the readdition. PS: The PEP mentions having to code for Python 3.0-3.4 as well, which would don't support the new methods. I think it's perfectly fine to have newly ported code to require Python 2.7/3.5+. After all, the porting effort will take some time as well. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 08 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Wed, Jan 8, 2014 at 3:40 AM, M.-A. Lemburg <mal@egenix.com> wrote:
PS: The PEP mentions having to code for Python 3.0-3.4 as well, which would don't support the new methods. I think it's perfectly fine to have newly ported code to require Python 2.7/3.5+. After all, the porting effort will take some time as well.
tl;dr We must get the relevant library projects involved in this discussion. I prefer Nick's solution to the problem at hand. <disclaimer> I've mostly stayed out of this discussion because I neither have many unicode-related use-cases nor a deep understanding of all the issues. However, my investment in the community is such that I've been following these discussions and hope to add what I can in what few places I chime in. :) </disclaimer> Requiring 3.5 may be tricky though. How soon will 3.5 show up in OS distros or be their system Python? Getting 3.5 on their system may not be a simple option for some (perhaps too few to matter?) and may be seen as too onerous to others. This effort is meant to ease porting to Python 3 and not as just a carrot like most other new features. It boils down to 3.5 being *the* target for porting from 2.7. Otherwise we'd be better off adding a new type to 3.5 for the wire-protocol use cases and providing a 2.7/3.x backport on the cheeseshop that would facilitate porting such code bases to 3.5. My understanding is that is basically what Nick has proposed (sorry, Nick, if I've misunderstood). The latter approach makes more sense to me. However, it seems like this whole discussion is motivated by a particular group of library projects. Regardless of what we discuss or the solutions on which we resolve, we'd be making a mistake if we did not do our utmost to ensure those projects are directly involved in these discussions. -eric
On Wed, 8 Jan 2014 11:16:49 -0700 Eric Snow <ericsnowcurrently@gmail.com> wrote:
It boils down to 3.5 being *the* target for porting from 2.7.
No. Please let's stop being self-deprecating. 3.3 is fine as a porting target, as the many high-profile libraries which have already been ported can attest.
Otherwise we'd be better off adding a new type to 3.5 for the wire-protocol use cases
I'm completely opposed to a new type. Regards Antoine.
Hi Victor, On Mon, 6 Jan 2014 14:24:50 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
There is a good use case at: https://mail.python.org/pipermail/python-ideas/2014-January/024803.html Regards Antoine.
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args? Then, the following points must be decided to define the complete list of supported features (formatters):
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error? * format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)`` * Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i``
HTML version of the PEP: http://www.python.org/dev/peps/pep-0460/
Inline copy:
PEP: 460 Title: Add bytes % args and bytes.format(args) to Python 3.5 Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner <victor.stinner@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 6-Jan-2014 Python-Version: 3.5
Abstract ========
Add ``bytes % args`` operator and ``bytes.format(args)`` method to Python 3.5.
Rationale =========
``bytes % args`` and ``bytes.format(args)`` have been removed in Python 2. This operator and this method are requested by Mercurial and Twisted developers to ease porting their project on Python 3.
Python 3 suggests to format text first and then encode to bytes. In some cases, it does not make sense because arguments are bytes strings. Typical usage is a network protocol which is binary, since data are send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP, POP, FTP are ASCII commands interspersed with binary data.
Using multiple ``bytes + bytes`` instructions is inefficient because it requires temporary buffers and copies which are slow and waste memory. Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even before the first release of Python 3.0 (see issue #3982).
``struct.pack()`` is incomplete. For example, a number cannot be formatted as decimal and it does not support padding bytes string.
Mercurial 2.8 still supports Python 2.4.
Needed and excluded features ============================
Needed features
* Bytes strings: bytes, bytearray and memoryview types * Format integer numbers as decimal * Padding with spaces and null bytes * "%s" should use the buffer protocol, not str()
The feature set is minimal to keep the implementation as simple as possible to limit the cost of the implementation. ``str % args`` and ``str.format(args)`` are already complex and difficult to maintain, the code is heavily optimized.
Excluded features:
* no implicit conversion from Unicode to bytes (ex: encode to ASCII or to Latin1) * Locale support (``{!n}`` format for numbers). Locales are related to text and usually to an encoding. * ``repr()``, ``ascii()``: ``%r``, ``{!r}``, ``%a`` and ``{!a}`` formats. ``repr()`` and ``ascii()`` are used to debug, the output is displayed a terminal or a graphical widget. They are more related to text. * Attribute access: ``{obj.attr}`` * Indexing: ``{dict[key]}`` * Features of struct.pack(). For example, format a number as 32 bit unsigned integer in network endian. The ``struct.pack()`` can be used to prepare arguments, the implementation should be kept simple. * Features of int.to_bytes(). * Features of ctypes. * New format protocol like a new ``__bformat__()`` method. Since the * list of supported types is short, there is no need to add a new protocol. Other types must be explicitly casted. * Alternate format for integer. For example, ``'{|#x}'.format(0x123)`` to get ``0x123``. It is more related to debug, and the prefix can be easily be written in the format string (ex: ``0x%x``). * Relation with format() and the __format__() protocol. bytes.format() and str.format() are unrelated.
Unknown:
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error? * format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)`` * Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i``
bytes % args ============
Formatters:
* ``"%c"``: one byte * ``"%s"``: integer or bytes strings * ``"%20s"`` pads to 20 bytes with spaces (``b' '``) * ``"%020s"`` pads to 20 bytes with zeros (``b'0'``) * ``"%\020s"`` pads to 20 bytes with null bytes (``b'\0'``)
bytes.format(args) ==================
Formatters:
* ``"{!c}"``: one byte * ``"{!s}"``: integer or bytes strings * ``"{!.20s}"`` pads to 20 bytes with spaces (``b' '``) * ``"{!.020s}"`` pads to 20 bytes with zeros (``b'0'``) * ``"{!\020s}"`` pads to 20 bytes with null bytes (``b'\0'``)
Examples ========
* ``b'a%sc%s' % (b'b', 4)`` gives ``b'abc4'`` * ``b'a{}c{}'.format(b'b', 4)`` gives ``b'abc4'`` * ``b'%c'`` % 88`` gives ``b'X``' * ``b'%%'`` gives ``b'%'``
Criticisms ==========
* The development cost and maintenance cost. * In 3.3 encoding to ascii or latin1 is as fast as memcpy * Developers must work around the lack of bytes%args and bytes.format(args) anyway to support Python 3.0-3.4 * bytes.join() is consistently faster than format to join bytes strings. * Formatting functions can be implemented in a third party module
References ==========
* `Issue #3982: support .format for bytes <http://bugs.python.org/issue3982>`_ * `Mercurial project <http://mercurial.selenic.com/>`_ * `Twisted project <http://twistedmatrix.com/trac/>`_ * `Documentation of Python 2 formatting (str % args) <http://docs.python.org/2/library/stdtypes.html#string-formatting>`_ * `Documentation of Python 2 formatting (str.format) <http://docs.python.org/2/library/string.html#formatstrings>`_
Copyright =========
This document has been placed in the public domain.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
On Mon, Jan 6, 2014 at 6:24 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Abstract ========
Add ``bytes % args`` operator and ``bytes.format(args)`` method to Python 3.5.
Rationale =========
``bytes % args`` and ``bytes.format(args)`` have been removed in Python 2. This operator and this method are requested by Mercurial and Twisted developers to ease porting their project on Python 3.
Python 3 suggests to format text first and then encode to bytes. In some cases, it does not make sense because arguments are bytes strings. Typical usage is a network protocol which is binary, since data are send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP, POP, FTP are ASCII commands interspersed with binary data.
Using multiple ``bytes + bytes`` instructions is inefficient because it requires temporary buffers and copies which are slow and waste memory. Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even before the first release of Python 3.0 (see issue #3982).
``struct.pack()`` is incomplete. For example, a number cannot be formatted as decimal and it does not support padding bytes string.
Mercurial 2.8 still supports Python 2.4.
As an alternative, we could provide an import hook via some channel (cheeseshop? recipe?) that converts just b'' formatting into some Python 3 equivalent (when run under Python 3). The argument against such import hooks is usually that they have an adverse impact on the output of tracebacks. However, I'd expect most b'' formatting to happen on a single line and that the replacement source would stay on that single line. Such an import hook would lessen the desire for bytes formatting. As I mentioned elsewhere, Nick's counter-proposal of a separate wire-protocol-friendly type makes more sense to me more than adding formatting to Python 3's bytes type. As others have opined, formatting a bytes object is out of place. The need is limited in scope and audience, but apparently real. Adding that capability directly to bytes in 3.5 should be a last resort to which we appeal only when we exhaust our other options. -eric
On Wed, 8 Jan 2014 11:59:51 -0700 Eric Snow <ericsnowcurrently@gmail.com> wrote:
As others have opined, formatting a bytes object is out of place.
However, interpolating a bytes object isn't out of place, and it is what a minimal "formatting" primitive could do. Regards Antoine.
Antoine Pitrou writes:
However, interpolating a bytes object isn't out of place, and it is what a minimal "formatting" primitive could do.
Something like this? # VERY incomplete pseudo-code class str: # new method # fmtstring has syntax of .format method's spec, maybe adding a 'B' # for "insert Blob of bytes" spec def format_for_wire(fmtstring, args, encoding='utf-8', errors='strict'): result = b'' # gotta go to a meeting, exercise for reader :-( parts = zip_specs_and_args(fmtstring, args) for spec, arg in parts: if spec == 'B' and isinstance(arg, bytes): result += arg else: partial = format(spec, arg) result += partial.encode(encoding=encoding, errors=errors) return result Maybe format_to_bytes is a more accurate name. I have no idea how to do this for %-formatting though. :-( And I have the sneaking suspicion that it *can't* be this easy. :-( Can it? :-)
Victor Stinner, 06.01.2014 14:24:
Abstract ======== Add ``bytes % args`` operator and ``bytes.format(args)`` method to Python 3.5.
Here is a counterproposal. Let someone who needs this feature write a library that does byte string formatting. That properly handles it, a full featured tool set. Write it in Cython if you need raw speed, that will also help in making it run in both Python 2 and Python 3, or in providing easy integration with buffers like the array module, various byte containers, NumPy, etc. I'm confident that this will show that the current Py2 code that (legitimately) does byte string formatting can actually be improved, simplified or sped up, at least in some corners. I'm sure Py2 byte string formatting wasn't perfect for this use case either, it just happened to be there, so everyone used it and worked around its particular quirks for the particular use case at hand. (Think of "%s" % some_unicode_value, for example.) Instead of waiting for 3.5, a third party library allows users to get started porting their code earlier, and to make it work unchanged with Python versions before 3.5. Stefan
On Wed, Jan 8, 2014 at 2:17 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Victor Stinner, 06.01.2014 14:24:
Abstract ======== Add ``bytes % args`` operator and ``bytes.format(args)`` method to Python 3.5.
Here is a counterproposal. Let someone who needs this feature write a library that does byte string formatting. That properly handles it, a full featured tool set. Write it in Cython if you need raw speed, that will also help in making it run in both Python 2 and Python 3, or in providing easy integration with buffers like the array module, various byte containers, NumPy, etc.
I'm confident that this will show that the current Py2 code that (legitimately) does byte string formatting can actually be improved, simplified or sped up, at least in some corners. I'm sure Py2 byte string formatting wasn't perfect for this use case either, it just happened to be there, so everyone used it and worked around its particular quirks for the particular use case at hand. (Think of "%s" % some_unicode_value, for example.)
Instead of waiting for 3.5, a third party library allows users to get started porting their code earlier, and to make it work unchanged with Python versions before 3.5.
Maybe we can enumerate some of the stated drawbacks of b''.format() Convenient string processing tools for bytes will make people ignore Unicode or fail to notice it or do it wrong? (As opposed to the alternative causing them to learn how to process and produce Unicode correctly?) Similar APIs on bytes and str will prevent implicit "assert isinstance(x, str)" checks? More-prevalent bytes will propagate across the program causing bugs? A-la open(b'filename').name vs open('filename').name ? It will take a long time. Hopeful benefits may include easier porting and greater Py3 adoption, less encoding dances and/or decoding non-Unicode into Unicode just to make things work, hopefully fewer surrogate-encoded bytes and therefore fewer encoding-bugs-distant-from-source-of-invalid-text, ...
Hi, Another remark about the PEP: it should define bytearray % args and bytearray.format(args) as well. Regards Antoine. On Mon, 6 Jan 2014 14:24:50 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args? Then, the following points must be decided to define the complete list of supported features (formatters):
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error? * format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)`` * Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i``
HTML version of the PEP: http://www.python.org/dev/peps/pep-0460/
Inline copy:
PEP: 460 Title: Add bytes % args and bytes.format(args) to Python 3.5 Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner <victor.stinner@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 6-Jan-2014 Python-Version: 3.5
Abstract ========
Add ``bytes % args`` operator and ``bytes.format(args)`` method to Python 3.5.
Rationale =========
``bytes % args`` and ``bytes.format(args)`` have been removed in Python 2. This operator and this method are requested by Mercurial and Twisted developers to ease porting their project on Python 3.
Python 3 suggests to format text first and then encode to bytes. In some cases, it does not make sense because arguments are bytes strings. Typical usage is a network protocol which is binary, since data are send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP, POP, FTP are ASCII commands interspersed with binary data.
Using multiple ``bytes + bytes`` instructions is inefficient because it requires temporary buffers and copies which are slow and waste memory. Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even before the first release of Python 3.0 (see issue #3982).
``struct.pack()`` is incomplete. For example, a number cannot be formatted as decimal and it does not support padding bytes string.
Mercurial 2.8 still supports Python 2.4.
Needed and excluded features ============================
Needed features
* Bytes strings: bytes, bytearray and memoryview types * Format integer numbers as decimal * Padding with spaces and null bytes * "%s" should use the buffer protocol, not str()
The feature set is minimal to keep the implementation as simple as possible to limit the cost of the implementation. ``str % args`` and ``str.format(args)`` are already complex and difficult to maintain, the code is heavily optimized.
Excluded features:
* no implicit conversion from Unicode to bytes (ex: encode to ASCII or to Latin1) * Locale support (``{!n}`` format for numbers). Locales are related to text and usually to an encoding. * ``repr()``, ``ascii()``: ``%r``, ``{!r}``, ``%a`` and ``{!a}`` formats. ``repr()`` and ``ascii()`` are used to debug, the output is displayed a terminal or a graphical widget. They are more related to text. * Attribute access: ``{obj.attr}`` * Indexing: ``{dict[key]}`` * Features of struct.pack(). For example, format a number as 32 bit unsigned integer in network endian. The ``struct.pack()`` can be used to prepare arguments, the implementation should be kept simple. * Features of int.to_bytes(). * Features of ctypes. * New format protocol like a new ``__bformat__()`` method. Since the * list of supported types is short, there is no need to add a new protocol. Other types must be explicitly casted. * Alternate format for integer. For example, ``'{|#x}'.format(0x123)`` to get ``0x123``. It is more related to debug, and the prefix can be easily be written in the format string (ex: ``0x%x``). * Relation with format() and the __format__() protocol. bytes.format() and str.format() are unrelated.
Unknown:
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error? * format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)`` * Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i``
bytes % args ============
Formatters:
* ``"%c"``: one byte * ``"%s"``: integer or bytes strings * ``"%20s"`` pads to 20 bytes with spaces (``b' '``) * ``"%020s"`` pads to 20 bytes with zeros (``b'0'``) * ``"%\020s"`` pads to 20 bytes with null bytes (``b'\0'``)
bytes.format(args) ==================
Formatters:
* ``"{!c}"``: one byte * ``"{!s}"``: integer or bytes strings * ``"{!.20s}"`` pads to 20 bytes with spaces (``b' '``) * ``"{!.020s}"`` pads to 20 bytes with zeros (``b'0'``) * ``"{!\020s}"`` pads to 20 bytes with null bytes (``b'\0'``)
Examples ========
* ``b'a%sc%s' % (b'b', 4)`` gives ``b'abc4'`` * ``b'a{}c{}'.format(b'b', 4)`` gives ``b'abc4'`` * ``b'%c'`` % 88`` gives ``b'X``' * ``b'%%'`` gives ``b'%'``
Criticisms ==========
* The development cost and maintenance cost. * In 3.3 encoding to ascii or latin1 is as fast as memcpy * Developers must work around the lack of bytes%args and bytes.format(args) anyway to support Python 3.0-3.4 * bytes.join() is consistently faster than format to join bytes strings. * Formatting functions can be implemented in a third party module
References ==========
* `Issue #3982: support .format for bytes <http://bugs.python.org/issue3982>`_ * `Mercurial project <http://mercurial.selenic.com/>`_ * `Twisted project <http://twistedmatrix.com/trac/>`_ * `Documentation of Python 2 formatting (str % args) <http://docs.python.org/2/library/stdtypes.html#string-formatting>`_ * `Documentation of Python 2 formatting (str.format) <http://docs.python.org/2/library/string.html#formatstrings>`_
Copyright =========
This document has been placed in the public domain.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
Hi, With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation. However, I also added bytearray into the mix, as bytearray objects should generally support the same operations as bytes (and they can be useful *especially* for network programming). Regards Antoine. On Mon, 6 Jan 2014 14:24:50 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args? Then, the following points must be decided to define the complete list of supported features (formatters):
On 9 Jan 2014 06:43, "Antoine Pitrou" <solipsis@pitrou.net> wrote:
Hi,
With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation.
+1 I was initially dubious about the idea, but the proposed semantics look good to me. We should probably include format_map for consistency with the str API.
However, I also added bytearray into the mix, as bytearray objects should generally support the same operations as bytes (and they can be useful *especially* for network programming).
So we'd define the *format* string as mutable to get a mutable result out of the formatting operations? This seems a little weird to me. It also seems weird for a format method on a mutable type to *not* perform in-place mutation. On the other hand, I don't see another obvious way to control the output type. Cheers, Nick.
Regards
Antoine.
On Mon, 6 Jan 2014 14:24:50 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args? Then, the following points must be decided to define the complete list of supported features (formatters):
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Fri, 10 Jan 2014 05:26:04 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
We should probably include format_map for consistency with the str API.
Yes, you're right.
However, I also added bytearray into the mix, as bytearray objects should generally support the same operations as bytes (and they can be useful *especially* for network programming).
So we'd define the *format* string as mutable to get a mutable result out of the formatting operations? This seems a little weird to me.
It also seems weird for a format method on a mutable type to *not* perform in-place mutation.
It's consistent with bytearray.join's behaviour:
x = bytearray() x.join([b"abc"]) bytearray(b'abc') x bytearray(b'')
Regards Antoine.
On 10 Jan 2014 03:32, "Antoine Pitrou" <solipsis@pitrou.net> wrote:
On Fri, 10 Jan 2014 05:26:04 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
We should probably include format_map for consistency with the str API.
Yes, you're right.
However, I also added bytearray into the mix, as bytearray objects should generally support the same operations as bytes (and they can be useful *especially* for network programming).
So we'd define the *format* string as mutable to get a mutable result
of the formatting operations? This seems a little weird to me.
It also seems weird for a format method on a mutable type to *not*
out perform
in-place mutation.
It's consistent with bytearray.join's behaviour:
x = bytearray() x.join([b"abc"]) bytearray(b'abc') x bytearray(b'')
Yeah, I guess I'm OK with us being consistent on that one. It's still weird, but also clearly useful :) Will the new binary format ever call __format__? I assume not, but it's probably best to make that absolutely explicit in the PEP. Cheers, Nick.
Regards
Antoine. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Fri, 10 Jan 2014 11:32:05 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
It's consistent with bytearray.join's behaviour:
x = bytearray() x.join([b"abc"]) bytearray(b'abc') x bytearray(b'')
Yeah, I guess I'm OK with us being consistent on that one. It's still weird, but also clearly useful :)
Will the new binary format ever call __format__? I assume not, but it's probably best to make that absolutely explicit in the PEP.
Not indeed. I'll add that to the PEP, thanks. cheers Antoine.
On 01/08/2014 02:42 PM, Antoine Pitrou wrote:
With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation.
From the PEP: =============
Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque "textual" values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers!
As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and then refuses to allow % to embed ASCII text in the byte stream.
All other features present in formatting of str objects (either through the percent operator or the str.format() method) are unsupported. Those features imply treating the recipient of the operator or method as text, which goes counter to the text / bytes separation (for example, accepting %d as a format code would imply that the bytes object really is a ASCII-compatible text string).
No, it implies that portion of the byte stream is ASCII compatible. And we have several examples: PDF, HTML, DBF, just about every network protocol (not counting M$), and, I'm sure, plenty I haven't heard of. -1 on the PEP as it stands now. -- ~Ethan~
On Fri, 10 Jan 2014 16:23:53 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/08/2014 02:42 PM, Antoine Pitrou wrote:
With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation.
From the PEP: =============
Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque "textual" values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers!
As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and then refuses to allow % to embed ASCII text in the byte stream.
Indeed I refuse for %-formatting to allow the mixing of bytes and str objects, in the same way that it is forbidden to concatenate "a" and b"b" together, or to write b"".join(["abc"]). Python 3 was made *precisely* because the implicit conversion between ASCII unicode and bytes is deemed harmful. It's completely counter-productive and misleading for our users to start mudding the message by introducing exceptions to that rule. Regards Antoine.
On 1/10/2014 8:12 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 16:23:53 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/08/2014 02:42 PM, Antoine Pitrou wrote:
With Victor's consent, I overhauled PEP 460 and made the feature set more restricted and consistent with the bytes/str separation.
From the PEP: =============
Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque "textual" values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers!
As it stands now, the PEP talks about ASCII, about how it makes sense sometimes to work directly with bytes objects, and then refuses to allow % to embed ASCII text in the byte stream.
Indeed I refuse for %-formatting to allow the mixing of bytes and str objects, in the same way that it is forbidden to concatenate "a" and b"b" together, or to write b"".join(["abc"]).
I think: 'a' + b'b' is different from: b'Content-Length: %d' % 42 The former we want to prevent, but I see nothing wrong with the latter. So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 . Eric.
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 .
Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough". (I don't care much personally, I think the issue is quite overblown anyway) Regards Antoine.
On 01/10/2014 06:04 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 .
Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough".
Heh, and here I thought it was stubborn opposition in the name of purity. ;)
(I don't care much personally, I think the issue is quite overblown anyway)
Is it safe to assume you don't use Python for the use-cases under discussion? Specifically, mixed ASCII, binary, and encoded-text byte streams? -- ~Ethan~
On Fri, 10 Jan 2014 18:28:41 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
Is it safe to assume you don't use Python for the use-cases under discussion?
You know, I've done quite a bit of network programming. I've also done an experimental port of Twisted to Python 3. I know what a network protocol with ill-defined encodings looks like. Regards Antoine.
On 01/10/2014 06:39 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 18:28:41 -0800 Ethan Furman wrote:
Is it safe to assume you don't use Python for the use-cases under discussion?
You know, I've done quite a bit of network programming.
No, I didn't, that's why I asked.
I've also done an experimental port of Twisted to Python 3. I know what a network protocol with ill-defined encodings looks like.
Can you give a code sample of what you think, for example, the PDF generation code should look like? (If you already have, I apologize -- I missed it in all the ruckus.) -- ~Ethan~
On 01/10/2014 06:39 PM, Antoine Pitrou wrote:
I know what a network protocol with ill-defined encodings looks like.
For the record, I've been (and I suspect Eric and some others have also been) talking about well-defined encodings. For the DBF files that I work with, there is binary, ASCII, and third that is specified in the file header. -- ~Ethan~
On 11 January 2014 12:28, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/10/2014 06:04 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 .
Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough".
Heh, and here I thought it was stubborn opposition in the name of purity. ;)
No, it's "the POSIX text model is completely broken and we're not letting people bring it back by stealth because they want to stuff their esoteric use case back into the builtin data types instead of writing their own dedicated type now that the builtin types don't handle it any more". Yes, we know we changed the text model and knocked wire protocols off their favoured perch, and we're (thoroughly) aware of the fact that wire protocol developers don't like the fact that the default model now strongly favours the vastly more common case of application development. However, until Benno volunteered to start experimenting with implementing an asciistr type yesterday, there have been *zero* meaningful attempts at trying to solve the issues with wire protocol manipulation outside the Python 3 core - instead there has just been a litany of whining that Python 3 is different from Python 2, and a complete and total refusal to attempt to understand *why* we changed the text model. The answer *should* be obvious: the POSIX based text model in Python 2 makes web frameworks easier to write at the expense of making web applications *harder* to write, and the same is true for every other domain where the wire protocol and file format handling is isolated to widely used frameworks and support libraries, with the application code itself operating mostly on text and structured data. With the Python 3 text model, we decided that was a terrible trade-off, so the core text model now *strongly* favours application code. This means that is now *boundary* code that may need additional helper types, because the core types aren't necessarily going to cover all those use cases any more. In particular, the bytes type is, and always will be, designed for pure binary manipulation, while the str type is designed for text manipulation. The weird kinda-text-kinda-binary 8-bit builtin type is gone, and *deliberately* so. I've been saying for years that people should experiment with creating a Python 3 extension type that behaves more like the Python 2 str type. For the standard library, we've never hit a case where the explicit encoding and decoding was so complicated that creating such a type seemed simpler, so *we're* not going to do it. After discussing it with me at LCA, Benno Rice offered to try out the idea, just to determine whether or not it was actually possible. If there are any CPython bugs that mean the idea *doesn't* currently work (such as interoperability issues in the core types), then I'm certainly happy for us to fix *those*. But we're never ever going to change the core text model back to the broken POSIX one, or even take steps in that direction. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I don't know what the fuss is about. This isn't about breaking the text model. It's about a convenient way to turn text into bytes using a default, lenient, way. Not the other way round. Here's my proposal b'foo%sbar' % (a) would implicitly apply the following function equivalent to every object in the tuple: def coerce_ascii(o): if has_bytes_interface(o): return o return o.encode('ascii', 'strict') There's no need for special %d or %f formatting. If more fanciful formatting is required, e.g. exponents or, or precision, then by all means, to it in the str domain: b'foo%sbar' %("%.15f"%(42.2, )) Basically, let's just support simple bytes interpolation that will support coercing into bytes by means of strict ascii. It's a one way convenience, explicitly requested, and for conselting adults. -----Original Message----- From: Python-Dev [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Nick Coghlan Sent: 11. janúar 2014 08:43 To: Ethan Furman Cc: python-dev@python.org Subject: Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5 No, it's "the POSIX text model is completely broken and we're not letting people bring it back by stealth because they want to stuff their esoteric use case back into the builtin data types instead of writing their own dedicated type now that the builtin types don't handle it any more".
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-01-11, 10:56 GMT, you wrote:
I don't know what the fuss is about.
I just cannot resist: When you are calm while everybody else is in the state of panic, you haven’t understood the problem. -- one of many collections of Murphy’s Laws Matěj -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iD8DBQFS0UBf4J/vJdlkhKwRAtc3AJ9c1ElUhLjvHX+Jw4/NvvmGABNbTQCfe9Zm rD65ozDhpj/Fu3ydM8Oipco= =TDQP -----END PGP SIGNATURE-----
Am 11.01.2014 09:43, schrieb Nick Coghlan:
On 11 January 2014 12:28, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/10/2014 06:04 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 .
Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough".
Heh, and here I thought it was stubborn opposition in the name of purity. ;)
No, it's "the POSIX text model is completely broken and we're not letting people bring it back by stealth because they want to stuff their esoteric use case back into the builtin data types instead of writing their own dedicated type now that the builtin types don't handle it any more".
Yes, we know we changed the text model and knocked wire protocols off their favoured perch, and we're (thoroughly) aware of the fact that wire protocol developers don't like the fact that the default model now strongly favours the vastly more common case of application development.
However, until Benno volunteered to start experimenting with implementing an asciistr type yesterday, there have been *zero* meaningful attempts at trying to solve the issues with wire protocol manipulation outside the Python 3 core
Can we please also include pseudo-binary file formats? It's not "just" wire protocols. Georg
On 01/11/2014 12:43 AM, Nick Coghlan wrote:
In particular, the bytes type is, and always will be, designed for pure binary manipulation [...]
I apologize for being blunt, but this is a lie. Lets take a look at the methods defined by bytes:
dir(b'') ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Are you really going to insist that expandtabs, isalnum, isalpha, isdigit, islower, isspace, istitle, isupper, ljust, lower, lstrip, rjust, splitlines, swapcase, title, upper, and zfill are pure binary manipulation methods? Let's take a look at the repr of bytes:
bytes([48, 49, 50, 51]) b'0123'
Wow, that sure doesn't look like binary data! Py3 did not go from three text models to two, it went to one good one (unicode strings) and one broken one (bytes). If the aim was indeed for pure binary manipulation, we failed. We left in bunches of methods which can *only* be interpreted as supporting ASCII manipulation. Due to backwards compatibility we cannot now finish yanking those out, so either we live with a half-dead class screaming "I want be ASCII! I want to be ASCII!" or add back the missing functionality. -- ~Ethan~
On 12 Jan 2014 03:29, "Ethan Furman" <ethan@stoneleaf.us> wrote:
On 01/11/2014 12:43 AM, Nick Coghlan wrote:
In particular, the bytes type is, and always will be, designed for pure binary manipulation [...]
I apologize for being blunt, but this is a lie.
Lets take a look at the methods defined by bytes:
dir(b'')
['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Are you really going to insist that expandtabs, isalnum, isalpha,
isdigit, islower, isspace, istitle, isupper, ljust, lower, lstrip, rjust, splitlines, swapcase, title, upper, and zfill are pure binary manipulation methods? Do you think I don't know that? However, those are all *in-place* modifications. Yes, they assume ASCII compatible formats, but they're a far cry from encouraging combination of data from potentially different sources. I'm also on record as considering this a design decision I regret, precisely because it has resulted in experienced Python 2 developers failing to understand that the Python 3 text model is *different* and they may need to create a new type.
Let's take a look at the repr of bytes:
bytes([48, 49, 50, 51])
b'0123'
Wow, that sure doesn't look like binary data!
Py3 did not go from three text models to two, it went to one good one
(unicode strings) and one broken one (bytes). If the aim was indeed for pure binary manipulation, we failed. We left in bunches of methods which can *only* be interpreted as supporting ASCII manipulation. No, no, no. We made some concessions in the design of the bytes type to *ease* development and debugging of ASCII compatible protocols *where we believed we could do so without compromising the underlying text model changes. Many experienced Python 2 developers are now suffering one of the worst cases of paradigm lock I have ever seen as they keep trying to make the Python 3 text model the same as the Python 2 one instead of actually learning how Python 3 works and recognising that they may actually need to create a new type for their use case and then potentially seek core dev assistance if that type reveals new interoperability bugs in the core types (or encounters old ones).
Due to backwards compatibility we cannot now finish yanking those out, so
either we live with a half-dead class screaming "I want be ASCII! I want to be ASCII!" or add back the missing functionality. No, we don't - we treat the core bytes type as PEP 460 does, by adding a *new* feature proposed by a couple people writing native Python 3 libraries like asyncio that makes binary formats easier to deal with without carrying forward even *more* broken assumptions from the Python 2 text model. (Remember, I'm in favour of Antoine's updated PEP, because it's a real spec for a new feature, rather than yet another proposal to bolt on even more text specific formatting features from someone that has never bothered to understand the reasons for the differences between the two versions). People that want a full hybrid type back can then pursue the custom extension type approach. Cheers, Nick.
-- ~Ethan~ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Am 11.01.2014 03:04, schrieb Antoine Pitrou:
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 .
I agree.
Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough".
(I don't care much personally, I think the issue is quite overblown anyway)
So you wouldn't mind another overhaul of the PEP including a bit more functionality again? :) I really think that practicality beats purity here. (I'm not advocating free mixing bytes and str, mind you!) Georg
On Sat, 11 Jan 2014 08:26:57 +0100 Georg Brandl <g.brandl@gmx.net> wrote:
Am 11.01.2014 03:04, schrieb Antoine Pitrou:
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 .
I agree.
Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough".
(I don't care much personally, I think the issue is quite overblown anyway)
So you wouldn't mind another overhaul of the PEP including a bit more functionality again? :) I really think that practicality beats purity here. (I'm not advocating free mixing bytes and str, mind you!)
The PEP already proposes a certain amount of practicality. I personally *would* mind adding %d and friends to it. But of course someone can fork the PEP or write another one. Regards Antoine.
For not caring much, your own stubbornness is quite notable throughout this discussion. Stones and glass houses. :) That said: Twisted and Mercurial aren't the only ones who are hurt by this, at all. I'm aware of at least two other projects who are actively hindered in their support or migration to Python 3 by the bytes type not having some basic functionality that "strings" had in 2.0. The purity crowd in here has brought up that it was an important and serious decision to split Text from Bytes in Py3, and I actually agree with that. However, it is missing some very real and very concrete use-cases -- there are multiple situations where there are byte streams which have a known text-subset which they really, really do need to operate on. There's been a number of examples given: PDF, HTTP, network streams that switch inline from text-ish to binary and back-again.. But, we can focus that down to a very narrow and not at all uncommon situation in the latter. Look at the HTTP Content-Length header. HTTP headers are fuzzy. My understanding is, per the RFCs, their body can be arbitrary octets to the exclusion of line feeds and DELs-- my understanding may be a bit off here, and please feel free to correct me -- but the relevant specifications are a bit fuzzy to begin with. To my understanding of the spec, the header field name is essentially an ASCII text field (sans separator), and the body is... anything, or nearly anything. This is HTTP, which is surely one of the most used protocols in the world. The need to be able to assemble and disassemble such streams of that is a real, valid use-case. But looking at it, now look to the Content-Length header I mentioned. It seems those who are declaring a purity priority in bytes/string separation think it reasonable to do things like: headers.append((b"Content-Length": ("%d" % (len(content))).encode("ascii"))) Or something. In the middle of processing a stream, you need to convert this number into a string then encode it into bytes to just represent the number as the extremely common, widely-accessible 7-bit ascii subset of its numerical value. This isn't some rare, grandiose or fiendish undertaking, or trying to merge Strings and Bytes back together: this is the simple practical recognition that representing a number as its ascii-numerical value is actually not at all uncommon. This position seems utterly astonishing in its ridiculousness to me. The recognition that the number "123" may be represented as b"123" surprises me as a controversial thing, considering how often I see it in real life. There is a LOT of code out there which needs a little bit of a middle ground between bytes and strings; it doesn't mean you are giving way and allowing strings and bytes to merge and giving up on the Edict of Separation. But there are real world use-cases where you simply need to be able to do many basic "String" like operations on byte-streams. The removal of the ability to use interpolation to construct such byte strings was a major regression in python 3 and is a big hurdle for more then a few projects to upgrade. I mean, its not like the "bytes" type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them-- Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
b"stephen hansen".title() b'Stephen Hansen'
How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine? I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes. Personally, I won't be converting my day job's codebase to Python 3 anytime soon (where 'soon' is defined as 'within five years, assuming a best-case scenario that a number of third-party issues are resolved. But! I'm aware and involved with other projects, and this has bit two of them specifically. I'm sure there are others who are not aware of this list or don't feel comfortable talking on it (as it is, I encouraged one of the project's coder to speak up, but they thought the question was a lost one due to previous responses on the original issue ticket and gave up.). On Fri, Jan 10, 2014 at 6:04 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Fri, 10 Jan 2014 20:53:09 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
So, I'm -1 on the PEP. It doesn't address the cases laid out in issue 3892. See for example http://bugs.python.org/issue3982#msg180432 .
Then we might as well not do anything, since any attempt to advance things is met by stubborn opposition in the name of "not far enough".
(I don't care much personally, I think the issue is quite overblown anyway)
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/me%2Bpython%40ixokai.io
On 1/11/2014 1:44 AM, Stephen Hansen wrote:
There's been a number of examples given: PDF, HTTP, network streams that switch inline from text-ish to binary and back-again.. But, we can focus that down to a very narrow and not at all uncommon situation in the latter.
PDF has been mentioned a few times. ReportLAB recently decided to convert to Python 3, and fairly quickly (from my perspective, it took them a _long_ time to decide to port, but once they decided to, then it seemed quick) produced an alpha version that passes many of their tests. I've not tried it yet, although it interests me, as I have some Python 2 code written only because ReportLAB didn't support Python 3, and I wanted to generate some PDF files. I'll be glad to get rid of the Python 2 code, once they are released. But I guess they figured out a solution that wasn't onerous, I'd have to go re-read the threads to be sure, but it seems they are running one code base for both... not sure of the details of what techniques they used, or if they ever used the % operator :) But I'm wondering, since they did what they did so quickly, if the "mixed bytes and str" use case is mostly, in fact, a mind-set issue... yes, likely some code has to change, but maybe the changes really aren't all that significant. I wouldn't want to drag them into this discussion, I'd rather they get the port complete, but it would be interesting to know what they did, and how they did it, and what problems they had, etc. If anyone here knows that code a bit, perhaps the diffs could be examined in their repository to figure out what they did, and how much it impacted their code. I do know they switched XML parsers along the way, as well as dealing with string handling differences.
Am 11.01.2014 10:44, schrieb Stephen Hansen:
I mean, its not like the "bytes" type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them--
Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
b"stephen hansen".title() b'Stephen Hansen'
How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine?
I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes.
This. Exactly. Thanks for putting it so nicely, Stephen. Georg
Am 11.01.2014 14:49, schrieb Georg Brandl:
Am 11.01.2014 10:44, schrieb Stephen Hansen:
I mean, its not like the "bytes" type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them--
Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
b"stephen hansen".title() b'Stephen Hansen'
How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine?
I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes.
This. Exactly. Thanks for putting it so nicely, Stephen.
To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated "asciistr" type right now. But it has the functionality, and it's way too late to remove it. Georg
On 11.01.2014 14:54, Georg Brandl wrote:
Am 11.01.2014 14:49, schrieb Georg Brandl:
Am 11.01.2014 10:44, schrieb Stephen Hansen:
I mean, its not like the "bytes" type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them--
Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
b"stephen hansen".title() b'Stephen Hansen'
How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine?
I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes.
This. Exactly. Thanks for putting it so nicely, Stephen.
To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated "asciistr" type right now. But it has the functionality, and it's way too late to remove it.
I think we need to step back a little from the purist view of things and give more emphasis on the "practicality beats purity" Zen. I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes. We don't have to be pedantic about the bytes/text separation. It doesn't help in real life. If you give programmers the choice they will - most of the time - do the right thing. If you don't give them the tools, they'll work around the missing features in a gazillion different ways of which many will probably miss a few edge cases. bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data. BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ? Most HTTP packets fall into this category, many email messages as well. And let's not forget that we don't live in a perfect world. Broken encodings are everywhere around you - just have a look at your spam folder for a decent chunk of example data :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 11 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 12 January 2014 01:15, M.-A. Lemburg <mal@egenix.com> wrote:
On 11.01.2014 14:54, Georg Brandl wrote:
Am 11.01.2014 14:49, schrieb Georg Brandl:
Am 11.01.2014 10:44, schrieb Stephen Hansen:
I mean, its not like the "bytes" type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them--
Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
> b"stephen hansen".title() b'Stephen Hansen'
How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine?
I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes.
This. Exactly. Thanks for putting it so nicely, Stephen.
To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated "asciistr" type right now. But it has the functionality, and it's way too late to remove it.
I think we need to step back a little from the purist view of things and give more emphasis on the "practicality beats purity" Zen.
I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes.
We don't have to be pedantic about the bytes/text separation. It doesn't help in real life.
Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the "it's really like ASCII text" mindset. We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_an... While that was an *expedient* (and, in fact, necessary) solution at the time, the fact it is still thoroughly confusing people 13 years later shows it is not a *comprehensible* solution.
If you give programmers the choice they will - most of the time - do the right thing. If you don't give them the tools, they'll work around the missing features in a gazillion different ways of which many will probably miss a few edge cases.
bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data.
Because people that aren't happy with the current bytes type persistently refuse to experiment with writing their own extension type to figure out what the API should look like. Jamming speculative API design into the core text model without experimenting in a third party extension first is a straight up stupid idea. Anyone that is pushing for this should be checking out Benno's first draft experimental prototype for asciistr and be working on getting it passing the test suite I created: https://github.com/jeamland/asciicompat The "Wah, you broke it and now I have completely forgotten how to create custom types, so I'm just going to piss and moan until somebody else fixes it" infantilism of the past five years in this regard has frankly pissed me off. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, 12 Jan 2014 01:34:26 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the "it's really like ASCII text" mindset.
+1 to what Nick says. Forcing some constructs to be explicit leads people to know about the issue and understand it, rather than sweep it under the carpet as Python 2 encouraged them to do. Yes, if you're dealing with a file format or network protocol, you'd better know in which charset its textual information is being expressed. It's a very sane question to ask yourself! Regards Antoine.
On 11.01.2014 16:34, Nick Coghlan wrote:
On 12 January 2014 01:15, M.-A. Lemburg <mal@egenix.com> wrote:
On 11.01.2014 14:54, Georg Brandl wrote:
Am 11.01.2014 14:49, schrieb Georg Brandl:
Am 11.01.2014 10:44, schrieb Stephen Hansen:
I mean, its not like the "bytes" type lacks knowledge of the subset of bytes that happen to be 7-bit ascii-compatible and can't perform text-ish operations on them--
Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
>> b"stephen hansen".title() b'Stephen Hansen'
How is this not a practical recognition that yes, while bytes are byte streams and not text, a huge subset of bytes are text-y, and as long as we maintain the barrier between higher characters and implicit conversion therein, we're fine?
I don't see the difference here. There is a very real, practical need to interpolate bytes. This very real, practical need includes the very real recognition that converting 12345 to b'12345' is not something weird, unusual, and subject to the thorny issues of Encodings. It is not violating the doctrine of separation of powers between Text and Bytes.
This. Exactly. Thanks for putting it so nicely, Stephen.
To elaborate: if the bytes type didn't have all this ASCII-aware functionality already, I think we would have (and be using) a dedicated "asciistr" type right now. But it has the functionality, and it's way too late to remove it.
I think we need to step back a little from the purist view of things and give more emphasis on the "practicality beats purity" Zen.
I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes.
We don't have to be pedantic about the bytes/text separation. It doesn't help in real life.
Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the "it's really like ASCII text" mindset.
We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX: http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_an...
While that was an *expedient* (and, in fact, necessary) solution at the time, the fact it is still thoroughly confusing people 13 years later shows it is not a *comprehensible* solution.
FWIW: I quite liked the Python 2 model, but perhaps that's because I already knww how Unicode works, so could use it to make my life easier ;-) Seriously, Unicode has always caused heated discussions and I don't expect this to change in the next 5-10 years. The point is: there is no 100% perfect solution either way and when you acknowledge this, things don't look black and white anymore, but instead full of colors :-) Python 3 forces people to actually use Unicode; in Python 2 they could easily avoid it. It's good to educate people on how it's used and the issues you can run into, but let's not forget that people are trying to get work done and we all love readable code. PEP 460 just adds two more methods to the bytes object which come in handy when formatting binary data; I don't think it has potential to muddy the Python 3 text model, given that the bytes object already exposes a dozen of other ASCII text methods :-)
If you give programmers the choice they will - most of the time - do the right thing. If you don't give them the tools, they'll work around the missing features in a gazillion different ways of which many will probably miss a few edge cases.
bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data.
Because people that aren't happy with the current bytes type persistently refuse to experiment with writing their own extension type to figure out what the API should look like. Jamming speculative API design into the core text model without experimenting in a third party extension first is a straight up stupid idea.
Anyone that is pushing for this should be checking out Benno's first draft experimental prototype for asciistr and be working on getting it passing the test suite I created: https://github.com/jeamland/asciicompat
The "Wah, you broke it and now I have completely forgotten how to create custom types, so I'm just going to piss and moan until somebody else fixes it" infantilism of the past five years in this regard has frankly pissed me off.
Ah, you see: we're entering heated discussions again :-) asciistr is interesting in that it coerces to bytes instead of to Unicode (as is the case in Python 2). At the moment it doesn't cover the more common case bytes + str, just str + bytes, but let's assume it would, then you'd write ... headers += asciistr('Length: %i bytes\n' % 123) headers += b'\n\n' body = b'...' socket.send(headers + body) ... With PEP 460, you could write the above as: ... headers += b'Length: %i bytes\n' % 123 headers += b'\n\n' body = b'...' socket.send(headers + body) ... IMO, that's more readable. Both variants essentially do the same thing: they implicitly coerce ASCII text strings to bytes, so conceptually, there's little difference. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 11 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
tl;dr: At the end I'm volunteering to look at real code that is having porting problems. On Sat, 11 Jan 2014 17:33:17 +0100, "M.-A. Lemburg" <mal@egenix.com> wrote:
asciistr is interesting in that it coerces to bytes instead of to Unicode (as is the case in Python 2).
At the moment it doesn't cover the more common case bytes + str, just str + bytes, but let's assume it would, then you'd write
... headers += asciistr('Length: %i bytes\n' % 123) headers += b'\n\n' body = b'...' socket.send(headers + body) ...
With PEP 460, you could write the above as:
... headers += b'Length: %i bytes\n' % 123 headers += b'\n\n' body = b'...' socket.send(headers + body) ...
IMO, that's more readable.
Both variants essentially do the same thing: they implicitly coerce ASCII text strings to bytes, so conceptually, there's little difference.
And if we are explicit: headers = u'Length: %i bytes\n' % 123 headers += u'\n\n' body = b'...' socket.send(headers.encode('ascii') + body) (I included the 'u' prefix only because we are talking about shared-codebase python2/python3 code.) That looks pretty readable to me, and it is explicit about what parts are text and what parts are binary. But of course we'd never do exactly that in any but the simplest of protocols and scripts. Instead we'd write a library that had one or more object that modeled our wire/file protocol. The text parts the API would accept input as text strings. The binary parts it would accept input as bytes. Then, when reading or writing the data stream, we perform the appropriate conversions on the appropriate parts. Our library does a more complex analog of 'socket.send(headers.encode('ascii') + body)', one that understands the various parts and glues them together, encoding the text parts to the appropriate encoding (often-but-not-always ascii) as it does so. And yes, I have written code that does this in Python3. What I haven't done is written that code to run in both Python3 and Python2. I *think* the only missing thing I would need to back-port it is the surrogateescape error handler, but I haven't tried it. And I could probably conditionalize the code to use latin1 on python2 instead and get away with it. And please note that email is probably the messiest of messy binary wire protocols. Not only do you have bytes and text mixed in the same data stream, with internal markers (in the text parts) that specify how to interpret the binary, including what encodings each part of that binary data is in for cases where that matters, you *also* have to deal with the possibility of there being *invalid* binary data mixed in with the ostensibly text parts, that you nevertheless are expected to both preserve and parse around. When I started adding back binary support to the email package, I was really annoyed by the lack of certain string features in the bytes type. But in the end, it turned out to be really simple to instead think of the text-with-invalid-bytes parts as *text*-with-invalid-bytes (surrogateescaped bytes). Now, if I was designing from the ground up I'd store the stuff that was really binary as bytes in the model object instead of storing it as surrogateescaed text, but that problem is a consequence of how we got from there to here (python2-email to python3-email-that-didn't-handle-8bit-data to python3-email-that-works) rather than a problem with the python3 core data model. So it seems like I'm with Nick and Antoine and company here. The byte-interpolation proposed by Antoine seems reasonable, but I don't see the *need* for the other stuff. I think that programs will be cleaner if the text parts of the protocol are handled *as text*. On the other hand, Ethan's point that bytes *does* have text methods is true. However, other than the perfectly-sensible-for-bytes split, strip, and ends/startswith, I don't think I actually use any of them. But! Our goal should be to help people convert to Python3. So how can we find out what the specific problems are that real-world programs are facing, look at the *actual code*, and help that project figure out the best way to make that code work in both python2 and python3? That seems like the best way to find out what needs to be added to python3 or pypi: help port the actual code of the developers who are running into problems. Yes, I'm volunteering to help with this, though of course I can't promise exactly how much time I'll have available. --David
On 12 January 2014 04:38, R. David Murray <rdmurray@bitdance.com> wrote:
But! Our goal should be to help people convert to Python3. So how can we find out what the specific problems are that real-world programs are facing, look at the *actual code*, and help that project figure out the best way to make that code work in both python2 and python3?
That seems like the best way to find out what needs to be added to python3 or pypi: help port the actual code of the developers who are running into problems.
Yes, I'm volunteering to help with this, though of course I can't promise exactly how much time I'll have available.
And, as has been the case for a long time, the PSF stands ready to help with funding credible grant proposals for Python 3 porting efforts. I believe some of the core devs (including David?) do freelance and contract work, so that's an option definitely worth considered if a project would like to support Python 3, but are having difficulty getting their with purely volunteer effort. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, 12 Jan 2014 17:51:41 +1000, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 12 January 2014 04:38, R. David Murray <rdmurray@bitdance.com> wrote:
But! Our goal should be to help people convert to Python3. So how can we find out what the specific problems are that real-world programs are facing, look at the *actual code*, and help that project figure out the best way to make that code work in both python2 and python3?
That seems like the best way to find out what needs to be added to python3 or pypi: help port the actual code of the developers who are running into problems.
Yes, I'm volunteering to help with this, though of course I can't promise exactly how much time I'll have available.
And, as has been the case for a long time, the PSF stands ready to help with funding credible grant proposals for Python 3 porting efforts. I believe some of the core devs (including David?) do freelance and contract work, so that's an option definitely worth considered if a project would like to support Python 3, but are having difficulty getting their with purely volunteer effort.
Yes, I do contract programming, as part of Murray and Walker, Inc (web site coming soon but not there yet). And yes I currently have time available in my schedule. --David
On Sat, Jan 11, 2014 at 05:33:17PM +0100, M.-A. Lemburg wrote:
FWIW: I quite liked the Python 2 model, but perhaps that's because I already knww how Unicode works, so could use it to make my life easier ;-)
/incredulous I would really love to see you justify that claim. How do you use the Python 2 string type to make processing Unicode text easier? -- Steven
On 12 January 2014 02:33, M.-A. Lemburg <mal@egenix.com> wrote:
On 11.01.2014 16:34, Nick Coghlan wrote:
While that was an *expedient* (and, in fact, necessary) solution at the time, the fact it is still thoroughly confusing people 13 years later shows it is not a *comprehensible* solution.
FWIW: I quite liked the Python 2 model, but perhaps that's because I already knww how Unicode works, so could use it to make my life easier ;-)
Right, I tried to capture that in http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_an... by pointing out that there are two *very* different kinds of code to consider when discussing text modelling. Application code lives in a nice clean world of structured data, text data and binary data, with clean conversion functions for switching between them. Boundary code, by contrast, has to deal with the messy task of translating between them all. The Python 2 text model is a convenient model for boundary code, because it implicitly allows switch between binary and text interpretations of a data stream, and that's often useful due to the way protocols and file formats are designed. However, that kind of implicit switching is thoroughly inappropriate for *application* code. So Python 3 switches the core text model to one where implicitly switching between the binary domain and the text domain is considered a *bad* thing, and we object strongly to any proposals which suggest blurry the boundaries again, since that is going back to a boundary code model rather than an application code one. I've been saying for years that we may need a third type, but it has been nigh on impossible to get boundary code developers to say anything more useful than "I preferred the Python 2 model, that was more convenient for me". Yes, we know it was (we do maintain both of them, after all, and did the update for the standard library's own boundary code), but application developers are vastly more common, so boundary code developers lost out on that one and we need to come up with solutions that *respect* the Python 3 text model, rather than trying to change it back to the Python 2 one.
Seriously, Unicode has always caused heated discussions and I don't expect this to change in the next 5-10 years.
The point is: there is no 100% perfect solution either way and when you acknowledge this, things don't look black and white anymore, but instead full of colors :-)
It would be nice if more boundary code developers actually did that rather than coming out with accusatory hyperbole and pining for the halcyon days of Python 2 where the text model favoured their use case over that of normal application developers.
Python 3 forces people to actually use Unicode; in Python 2 they could easily avoid it. It's good to educate people on how it's used and the issues you can run into, but let's not forget that people are trying to get work done and we all love readable code.
PEP 460 just adds two more methods to the bytes object which come in handy when formatting binary data; I don't think it has potential to muddy the Python 3 text model, given that the bytes object already exposes a dozen of other ASCII text methods :-)
I dropped my objections to PEP 460 once Antoine fixed it to respect the boundaries between binary and text data. It's now a pure binary interpolation proposal, and one I think is a fine idea - there's no implicit encoding or decoding involved, it's just a tool for manipulating binary data. That leaves the implicit encoding and decoding to the third party asciistr type, as it should be.
asciistr is interesting in that it coerces to bytes instead of to Unicode (as is the case in Python 2).
Not quite - the idea of asciistr is that it is designed to be a *hybrid* type, like str was in Python 2. If it interacts with binary objects, it will give a binary result, if it interacts with text objects, it will give a text result. This makes it potentially suitable for use for constants in hybrid binary/text APIs like urllib.parse, allowing them to be implemented using a shared code path once again. The initial experimental implementation only works with 7 bit ASCII, but the UTF-8 caching in the PEP 393 implementation opens up the possibility of offering a non-strict mode in the future, as does the option of allowing arbitrary 8-bit data and disallowing interoperation with text strings in that case.
At the moment it doesn't cover the more common case bytes + str, just str + bytes, but let's assume it would,
Right, I suspect we have some overbroad PyUnicode_Check() calls in CPython that will need to be addressed before this substitution works seamlessly - that's one of the reasons I've been asking people to experiment with the idea since at least 2010 and let us know what doesn't work (nobody did though, until Benno agreed to try it out because it sounded like an interesting puzzle - I guess everyone else just found it easier to accuse us of being clueless idiots rather than considering trying to meet us halfway).
then you'd write
... headers += asciistr('Length: %i bytes\n' % 123)
If you're going to wait until *after* the formatting to do the conversion, you may as well just use encode explicitly: headers += ('Length: %i bytes\n' % 123).encode('ascii') The advantage of asciistr is that it allows you to abstract away the format strings for the headers in a way explicit encoding doesn't allow: FMT_LENGTH = asciistr('Length: %i bytes\n') headers += FMT_LENGTH % 123 headers += b'\n\n' body = b'...' socket.send(headers + body) You could do it inline as well: headers += asciistr('Length: %i bytes\n') % 123 But again, that doesn't offer a lot over simply explicitly encoding that fragment as ASCII.
With PEP 460, you could write the above as: ... headers += b'Length: %i bytes\n' % 123 headers += b'\n\n' body = b'...' socket.send(headers + body) ...
IMO, that's more readable.
At the cost of introducing an implicit encoding step again - it interpolates numbers into arbitrary binary sequences as ASCII text. That is thoroughly inappropriate in Python 3 - serialising semantically significant structured data (like numbers) as ASCII must always be opt in, either through environmental configuration (which has its own problems due to some undesirable default behaviour on POSIX systems - users will "opt in" to ASCII by mistake, not because they actually intended to), by passing it as an encoding argument, or by using a third party type like asciistr that is explicitly documented as only working with ASCII compatible data (whereas, with a couple of minor exceptions inherited from Python 2, the core bytes type is designed to work *correctly* with arbitrary binary data, and just has some *convenience* operations that assume ASCII data).
Both variants essentially do the same thing: they implicitly coerce ASCII text strings to bytes, so conceptually, there's little difference.
There's all the difference in the world: asciistr is a separate third party type that is deliberately designed to only work correctly with ASCII compatible binary data. If you use it for data that *isn't* ASCII compatible, then the resulting data corruption is due to using the wrong type, rather than being an implicit behaviour of a builtin Python type. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 01/11/2014 07:34 AM, Nick Coghlan wrote:
On 12 January 2014 01:15, M.-A. Lemburg wrote:
We don't have to be pedantic about the bytes/text separation. It doesn't help in real life.
Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works . . .
We are not proposing a change to the unicode string type in any way.
We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX . . .
We are not asking for that.
bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data.
Because people that aren't happy with the current bytes type persistently refuse to experiment with writing their own extension type to figure out what the API should look like. Jamming speculative API design into the core text model without experimenting in a third party extension first is a straight up stupid idea.
True, if this were a new API; but it isn't, it's the Py2 str API that was stripped out. The one big difference being that if the results of %s (or %d or any other %) is not in the 0-127 range it errors out. -- ~Ethan~
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-01-11, 18:09 GMT, you wrote:
We are NOT going back to the confusing incoherent mess that is the Python 2 model of bolting Unicode onto the side of POSIX . . .
We are not asking for that.
Yes, you do. Maybe not you personally, but number of people here on this list (for F...k sake, this is for DEVELOPERS of the langauge, not some bloody users!) for whom the current suggestion is just the way how to avoid Unicode and keep all those broken script which barfs at me all the time alive is quit non-zero I am afraid. Best, Matěj -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iD8DBQFS0ev24J/vJdlkhKwRAoHOAJ9crimnp+TtXCxmZLvTUSFVFSESAwCeNrby Yjwk6Ydzc/REezfHP046C5Y= =c2vl -----END PGP SIGNATURE-----
On Jan 11, 2014, at 10:34 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Yes, it bloody well does. The number of people who have told me that using Python 3 is what allowed them to finally understand how Unicode works vastly exceeds the number of wire protocol and file format devs that have complained about working with binary formats being significantly less tolerant of the "it's really like ASCII text" mindset.
FWIW as one of the people who it took Python3 to finally figure out how to actually use unicode, it was the absence of encode on bytes and decode on str that actually did it. Giving bytes a format method would not have affected that either way I don’t believe. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On 01/11/2014 12:45 PM, Donald Stufft wrote:
FWIW as one of the people who it took Python3 to finally figure out how to actually use unicode, it was the absence of encode on bytes and decode on str that actually did it. Giving bytes a format method would not have affected that either way I don’t believe.
My biggest hurdle was realizing that ASCII was an encoding. -- ~Ethan~
M.-A. Lemburg writes:
I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes.
We already *have* a type in Python 3.3 that provides text manipulations on arrays of 8-bit objects: str (per PEP 393).
BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ?
Isn't it equally obvious that if you create or read all such ASCII- compatible chunks as (encoding='ascii', errors='surrogateescape') that you *don't need* string APIs for bytes? Why do these "text chunks" need to be bytes in the first place? That's why we ask for use cases. AFAICS, reading and writing ASCII- compatible text data as 'latin1' is just as fast as bytes I/O. So it's not I/O efficiency, and (since in this model we don't do any en/decoding on bytes/str), it's not redundant en/decoding of bytes to str and back.
On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote:
We already *have* a type in Python 3.3 that provides text manipulations on arrays of 8-bit objects: str (per PEP 393).
BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ?
Isn't it equally obvious that if you create or read all such ASCII- compatible chunks as (encoding='ascii', errors='surrogateescape') that you *don't need* string APIs for bytes?
Why do these "text chunks" need to be bytes in the first place? That's why we ask for use cases. AFAICS, reading and writing ASCII- compatible text data as 'latin1' is just as fast as bytes I/O. So it's not I/O efficiency, and (since in this model we don't do any en/decoding on bytes/str), it's not redundant en/decoding of bytes to str and back.
The problem with some criticisms of using 'unicode in Python 3' is that there really is no such thing. Unicode in 3.0 to 3.2 used the old internal model inherited from 2.x. Unicode in 3.3+ uses a different internal model that is a game changer with respect to certain issues of space and time efficiency (and cross-platform correctness and portability). So at least some the valid criticisms based on the old model are out of date and no longer valid. -- Terry Jan Reedy
On Sat, Jan 11, 2014 at 4:28 PM, Terry Reedy <tjreedy@udel.edu> wrote:
On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote:
We already *have* a type in Python 3.3 that provides text manipulations on arrays of 8-bit objects: str (per PEP 393).
BTW: I don't know why so many people keep asking for use cases. Isn't it obvious that text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk is part of life ?
Isn't it equally obvious that if you create or read all such ASCII- compatible chunks as (encoding='ascii', errors='surrogateescape') that you *don't need* string APIs for bytes?
Why do these "text chunks" need to be bytes in the first place? That's why we ask for use cases. AFAICS, reading and writing ASCII- compatible text data as 'latin1' is just as fast as bytes I/O. So it's not I/O efficiency, and (since in this model we don't do any en/decoding on bytes/str), it's not redundant en/decoding of bytes to str and back.
The problem with some criticisms of using 'unicode in Python 3' is that there really is no such thing. Unicode in 3.0 to 3.2 used the old internal model inherited from 2.x. Unicode in 3.3+ uses a different internal model that is a game changer with respect to certain issues of space and time efficiency (and cross-platform correctness and portability). So at least some the valid criticisms based on the old model are out of date and no longer valid.
-1 on adding more surrogateesapes by default. It's a pain to track down where the encoding errors came from.
Daniel Holth writes:
-1 on adding more surrogateesapes by default. It's a pain to track down where the encoding errors came from.
What do you mean "by default"? It was quite explicit in the code I posted, and it's the only reasonable thing to do with "text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk". If you leave it as bytes, it will barf as soon as you try to mix it with text even if it is pure ASCII!
On 01/12/2014 12:39 PM, Stephen J. Turnbull wrote:
Daniel Holth writes:
-1 on adding more surrogateesapes by default. It's a pain to track down where the encoding errors came from.
What do you mean "by default"? It was quite explicit in the code I posted, and it's the only reasonable thing to do with "text data without known (but ASCII compatible) encoding or multiple different encodings in a single data chunk". If you leave it as bytes, it will barf as soon as you try to mix it with text even if it is pure ASCII!
Which is why some (including myself) are asking to be able to stay in bytes land and do any necessary interpolation there. No resulting unicode, no barfing. ;) -- ~Ethan~
Why not just use six.byte_format(fmt, *args)? It works on both Python2 and Python3 and accepts the numerical format specifiers, plus '%b' for inserting bytes and '%a' for converting text to ascii. Admittedly it doesn't exist yet, but it could and it would save a lot of arguing :) (Apologies to anyone who doesn't appreciate my mischievous sense of humour) Cheers, Mark.
On 01/12/2014 01:59 PM, Mark Shannon wrote:
Why not just use six.byte_format(fmt, *args)? It works on both Python2 and Python3 and accepts the numerical format specifiers, plus '%b' for inserting bytes and '%a' for converting text to ascii.
Sounds like the second best option!
Admittedly it doesn't exist yet, but it could and it would save a lot of arguing :)
:) -- ~Ethan~
On Sat, Jan 11, 2014 at 04:28:34PM -0500, Terry Reedy wrote:
The problem with some criticisms of using 'unicode in Python 3' is that there really is no such thing. Unicode in 3.0 to 3.2 used the old internal model inherited from 2.x. Unicode in 3.3+ uses a different internal model that is a game changer with respect to certain issues of space and time efficiency (and cross-platform correctness and portability). So at least some the valid criticisms based on the old model are out of date and no longer valid.
While there are definitely performance savings (particularly of memory) regarding the FSR in Python 3.3, for the use-case we're talking about, Python 3.1 and 3.2 (and for that matter, 2.2 through 2.7) Unicode strings should be perfectly adequate. The textual data being used is ASCII, and the binary blobs are encoded to Latin-1, so everything is a subset of Unicode, namely U+0000 to U+00FF. That means there are no astral characters, and no behavioural differences between wide and narrow builds (apart from memory use). -- Steven
On Sat, Jan 11, 2014 at 04:15:35PM +0100, M.-A. Lemburg wrote:
I think we need to step back a little from the purist view of things and give more emphasis on the "practicality beats purity" Zen.
I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes.
Later in your post, you talk about the masses of broken encodings found everywhere (not just in your spam folder). How do the C lib standard string APIs help programmers to avoid broken encodings?
We don't have to be pedantic about the bytes/text separation. It doesn't help in real life.
On the contrary, it helps a lot. To the extent that people keep that clean bytes/text separation, it helps avoid bugs. It prevents problems like this Python 2 nonsense: s = "Straße" assert len(s) == 6 # fails assert s[5] == 'e' # fails Most problematic, printing s may (depending on your terminal settings) actually look like "Straße". Not only is having a clean bytes/text separation the pedantic thing to do, it's also the right thing to do nearly always (not withstanding a few exceptions, allegedly).
If you give programmers the choice they will - most of the time - do the right thing.
Unicode has been available in Python since version 2.2, more than a decade ago. And yet here we are, five point releases later (2.7), and the majority of text processing code is still using bytes. I'm not just pointing the finger at others. My 2.x only code almost always uses byte strings for text processing, and not always because it was old code I wrote before I knew better. The coders I work with do the same, only you can remove the word "almost". The code I see posted on comp.lang.python and Reddit and the tutor mailing list invariably uses byte strings. The beginners on the tutor list at least have an excuse that they are beginners. A quarter of a century after Unicode was first published, nearly 28 years since IBM first introduced the concept of "code pages" to PC users, and we still have programmers writing ASCII only string-handling code that, if it works with extended character sets, only works by accident. The majority of programmer still have *no idea* of even the most basic parts of Unicode. They've had the the right tools for a decade, and ignored them. Python 3 forces the issue, and my code is better for it.
bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data.
I personally think it was a mistake to keep text operations like upper() and lower() on bytes. I think it will compound the mistake to add even more text operations. -- Steven
On 2014-01-06 13:24, Victor Stinner wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
The PEP is a draft with open questions. First, I'm not sure that both bytes%args and bytes.format(args) are needed. The implementation of .format() is more complex, so why not only adding bytes%args? Then, the following points must be decided to define the complete list of supported features (formatters):
* Format integer to hexadecimal? ``%x`` and ``%X`` * Format integer to octal? ``%o`` * Format integer to binary? ``{!b}`` * Alignment? * Truncating? Truncate or raise an error? * format keywords? ``b'{arg}'.format(arg=5)`` * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)`` * Floating point number? * ``%i``, ``%u`` and ``%d`` formats for integer numbers? * Signed number? ``%+i`` and ``%-i``
I'm thinking that the "i" format could be used for signed integers and the "u" for unsigned integers. The width would be the number of bytes. You would also need to have a way of specifying the endianness. For example:
b'{:<2i}'.format(256) b'\x01\x00' b'{:>2i}'.format(256) b'\x00\x01'
Perhaps the width should default to 1 in the cases of "i" and "u":
b'{:i}'.format(-1) b'\xFF' b'{:u}'.format(255) b'\xFF' b'{:i}'.format(255) ValueError: ...
Interestingly, I've just been checking what exception is raised for some format types, and I got this:
'{:c}'.format(-1) Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: %c arg not in range(0x110000)
Should the exception be OverflowError (probably yes), and should the message say "%c"?
On Thu, 09 Jan 2014 03:54:13 +0000 MRAB <python@mrabarnett.plus.com> wrote:
I'm thinking that the "i" format could be used for signed integers and the "u" for unsigned integers. The width would be the number of bytes. You would also need to have a way of specifying the endianness.
For example:
b'{:<2i}'.format(256) b'\x01\x00' b'{:>2i}'.format(256) b'\x00\x01'
The goal is not to add an alternative to the struct module. If you need binary packing/unpacking, just use struct. Regards Antoine.
On 06/01/2014 13:24, Victor Stinner wrote:
Hi,
bytes % args and bytes.format(args) are requested by Mercurial and Twisted projects. The issue #3982 was stuck because nobody proposed a complete definition of the "new" features. Here is a try as a PEP.
Apologies if this has already been said, but Terry Reedy attached a proof of concept to issue 3982 which might be worth taking a look at if you haven't yet done so. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
participants (35)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Barry Warsaw
-
Benjamin Peterson
-
Brett Cannon
-
Chris Angelico
-
Daniel Holth
-
Donald Stufft
-
Eric Snow
-
Eric V. Smith
-
Ethan Furman
-
Georg Brandl
-
Glenn Linderman
-
Hrvoje Niksic
-
Kristján Valur Jónsson
-
M.-A. Lemburg
-
Mark Lawrence
-
Mark Shannon
-
matej@ceplovi.cz
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
R. David Murray
-
Serhiy Storchaka
-
Skip Montanaro
-
Stefan Behnel
-
Stefan Krah
-
Stephen Hansen
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney
-
Toshio Kuratomi
-
Victor Stinner
-
Xavier Morel