Stop displaying elements of bytes objects as printable ASCII characters in CPython 3

Why did the CPython core developers decide to force the display of ASCII characters in the printable representation of bytes objects in CPython 3? For example >>> import struct >>> # In go bytes for four floats: >>> my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12, 1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955) >>> # And out comes a speciously human-readable representation of those bytes >>> my_packed_bytes b'Why, Guido? Why?' >>> >>> # But it's just an illusion; it's truly bytes underneath! >>> a_reasonable_representation = bytes((0x57, 0x68, 0x79, 0x2c, 0x20, 0x47, 0x75, 0x69, 0x64, 0x6f, 0x3f, 0x20, 0x57, 0x68, 0x79, 0x3f)) >>> my_packed_bytes == a_reasonable_reperesentation True >>> >>> this_also_seems_reasonable = b'\x57\x68\x79\x2c\x20\x47\x75\x69\x64\x6f\x3f\x20\x57\x68\x79\x3f' >>> my_packed_bytes == this_also_seems_reasonable True I understand bytes literals were brought in to Python 3 to aid the transition from Python 2 to Python 3 [1], but this did not imply that `repr()` on a bytes object ought to display bytes mapping to ASCII characters as ASCII characters. I have not yet found a PEP describing why this decision was made. I am now seeking to put forth a PEP to change printable representation of bytes to be simple, consistent, and easy to understand. The current behavior printing of elements of bytes with a mapping to printable ASCII characters as those characters seems to violate multiple tenants of the Zen of Python [2] * "Explicit is better than implicit." This display happens without the user's explicit request. * "Simple is better than complex." The printable representation of bytes is complex, surprising, and unintuitive: Elements of bytes shall be displayed as their hexadecimal value, unless such a value maps to a printable ASCII character, in which case, the character shall be displayed instead of the hexadecimal value. The underlying values of each element, however, are always integers. The printable representation of an element of a byte will always be an integer representation. The simple thing is to show the hex value for every byte, unconditionally. * "Special cases aren't special enough to break the rules." Implicit decoding of bytes to ASCII characters comes in handy only some of the time. * "In the face of ambiguity, refuse the temptation to guess." Python is guessing that I want to see some bytes as ASCII characters. In the example above, though, what I gave Python was bytes from four floating point numbers. * "There should be one-- and preferably only one --obvious way to do it." `bytes.decode('ascii', errors='backslashreplace')` already provides users the means to display ASCII characters among bytes, as a real string. To be fair, there are two tenants of the Zen of Python that support the implicit display of ASCII characters in bytes: * "Readability counts." * "Although practicality beats purity." In counterargument, though, I would say that the extra readability and practicality are only served boosted in special cases (which are not special enough). Much ado was (and continues to be) raised over Python 3 enforcing distinction between (Unicode) strings and bytes. A lot of this resentment comes from Python programmers who do not yet appreciate the difference between bytes and text†, or from those who remain apathetic and prefer Python 2's it-works-'til-it-doesn't strings. This implicit displaying of ASCII characters in bytes ends up conflating the two data types even deeper in novice programmers' minds. In the example above, `my_packed_bytes` looks like a string. It reads like a string. But it is not a string. The ASCII characters are a lie, as evidenced when trying to access elements of a bytes instance: >>> b'Why, Guido? Why?'[0] 87 >>> # Oh, perhaps you were expecting b'W'? I find this behavior harmful to Python 3 advocacy, and novices and those accustomed to Python 2 find this yet another deterrent in the way of Python 3 adoption. I would like to gauge the feasibility of a PEP to change the printable representation of bytes in CPython 3 to display all elements by their hexadecimal values, and only by their hexadecimal values. Thanks, Chris L. † I write this as someone who, himself, didn't appreciate nor understand the difference between bytes, strings, and Unicode. I have Ned Batchelder [3] to thank and his illuminating "Pragmatic Unicode" presentation [4] for getting me on the right path. [1]: http://legacy.python.org/dev/peps/pep-3112/#rationale [2]: http://legacy.python.org/dev/peps/pep-0020/ [3]: http://nedbatchelder.com/ [4]: http://nedbatchelder.com/text/unipain.html

On 10 September 2014 17:04, Chris Lasher <chris.lasher@gmail.com> wrote:
Why did the CPython core developers decide to force the display of ASCII characters in the printable representation of bytes objects in CPython 3?
Primarily because it's incredibly useful for debugging ASCII based binary formats (which covers many network protocols and file formats). Early (pre-release) versions of Python 3.0 didn't have this behaviour, and getting the raw integer dumps instead turned out to be *really* annoying in practice, so we decided the easier debugging justified the increased risk of creating incorrect mental models for users (especially those migrating from Python 2). The recently updated docs for binary sequences hopefully do a better job of explaining this "binary data with ASCII compatible segments" favouritism in their description of the bytes and bytearray methods: https://docs.python.org/3/library/stdtypes.html#bytes-and-bytearray-operatio... (until a couple of months ago, those methods weren't documented separately, which I agree must have been incredibly confusing). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 10.09.2014 09:04, Chris Lasher wrote:
Why did the CPython core developers decide to force the display of ASCII characters in the printable representation of bytes objects in CPython 3?
This wasn't forced. It's a simple consequence of turning the Python 2 8-bit string type into the Python 3 bytes type while keeping breakage to a pain level which doesn't have Python users skip Python 3 entirely ;-) Seriously, it doesn't help being purist over concepts that are used in a very pragmatic way in every day (programmer's) life. Even when being binary data, most such binary strings do contain encoded text characters and being able to quickly identify those as such helps in debugging, working with the data and writing it down in form of literals. A definite -1 from me on making repr(b"Hello World") harder to read than necessary. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2014-09-19: PyCon UK 2014, Coventry, UK ... 9 days to go 2014-09-27: PyDDF Sprint 2014 ... 17 days to go 2014-09-30: Python Meeting Duesseldorf ... 20 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 10 September 2014 17:36, M.-A. Lemburg <mal@egenix.com> wrote:
On 10.09.2014 09:04, Chris Lasher wrote:
Why did the CPython core developers decide to force the display of ASCII characters in the printable representation of bytes objects in CPython 3?
This wasn't forced. It's a simple consequence of turning the Python 2 8-bit string type into the Python 3 bytes type while keeping breakage to a pain level which doesn't have Python users skip Python 3 entirely ;-)
I believe you may be forgetting the pre-release period where there wasn't an immutable bytes types at all. It wasn't until PEP 3137 [1] was implemented that we got to the status quo for Python 3. Cheers, Nick. P.S. I haven't forgotten my promise to try to put together a recipe for a cleaner wrapper around "memoryview(data).cast('c')", but it may be a while before I get back to the idea. [1] http://www.python.org/dev/peps/pep-3137/ -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 10.09.2014 09:43, Nick Coghlan wrote:
On 10 September 2014 17:36, M.-A. Lemburg <mal@egenix.com> wrote:
On 10.09.2014 09:04, Chris Lasher wrote:
Why did the CPython core developers decide to force the display of ASCII characters in the printable representation of bytes objects in CPython 3?
This wasn't forced. It's a simple consequence of turning the Python 2 8-bit string type into the Python 3 bytes type while keeping breakage to a pain level which doesn't have Python users skip Python 3 entirely ;-)
I believe you may be forgetting the pre-release period where there wasn't an immutable bytes types at all. It wasn't until PEP 3137 [1] was implemented that we got to the status quo for Python 3.
Oh, I do know. That was a path which was luckily quickly abandoned as default bytes type :-) Note that we now have PyByteArray C APIs in Python 3 for bytearray objects. PyBytes C APIs are (mostly) the Python 2 PyString C APIs - unlike what's listed in the PEP.
Cheers, Nick.
P.S. I haven't forgotten my promise to try to put together a recipe for a cleaner wrapper around "memoryview(data).cast('c')", but it may be a while before I get back to the idea.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2014-09-19: PyCon UK 2014, Coventry, UK ... 9 days to go 2014-09-27: PyDDF Sprint 2014 ... 17 days to go 2014-09-30: Python Meeting Duesseldorf ... 20 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 10 September 2014 08:04, Chris Lasher <chris.lasher@gmail.com> wrote:
Why did the CPython core developers decide to force the display of ASCII characters in the printable representation of bytes objects in CPython 3?
I'd argue this is symptomatic of something that got mentioned in the lengthy discussions around PEP 461: namely, that Python's bytestrings are really still very stringy. For example, they retain their 'upper' method, which is so totally bizarre in the context of bytes that it causes me to mentally segfault every time I see it:
a = b'hi there' a.upper() b'HI THERE'
As Nick mentioned, this is fundamentally because of protocols like HTTP/1.1, which are a weird hybrid of text-based and binary that is only simple if you assume ASCII everywhere. (Of course, HTTP does not assume ASCII everywhere, but that's because it's wildly underspecified). I doubt you'll get far with this proposal on this list, which is a shame because I think you have a point. There is an impedance mismatch between the Python community saying "Bytes are not text" and the fact that, wow, they really do look like they are sometimes! For what it's worth, Nick has made this comment:
Primarily because it's incredibly useful for debugging ASCII based binary formats (which covers many network protocols and file formats).
This is true, but it goes both ways: it makes it a lot *harder* to debug pure-binary network formats (like HTTP/2). I basically have to have an ASCII codepage in front of me to debug using the printed representation of a bytestring because I keep getting characters thrown into my nice hex output. Sadly, you can't please everyone.

On 10 September 2014 17:42, Cory Benfield <cory@lukasa.co.uk> wrote:
For what it's worth, Nick has made this comment:
Primarily because it's incredibly useful for debugging ASCII based binary formats (which covers many network protocols and file formats).
This is true, but it goes both ways: it makes it a lot *harder* to debug pure-binary network formats (like HTTP/2). I basically have to have an ASCII codepage in front of me to debug using the printed representation of a bytestring because I keep getting characters thrown into my nice hex output. Sadly, you can't please everyone.
memoryview.cast can be a potentially useful tool for that :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 10 September 2014 08:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
memoryview.cast can be a potentially useful tool for that :)
Sure, and so can binascii.hexlify (which is what I normally use). I'm not saying I can't debug my binary data, that would be a pretty weird complaint. I'm saying that I don't get to do debugging with a simple print statement when using the bytes type to do actual binary work, while those who are doing sort-of binary work do. I'm not actually personally bothered here: I do plenty of HTTP/1.1 work as well where this 'feature' is useful, and I'm experienced enough that I know what I'm getting into and I can work around it. My point is more that this adds further confusion to the notion that 'bytes are not text', when much of the language seems to go out of its way to pretend that they are in fact ASCII-encoded text. This is probably going to get wildly off-topic, so if you'd like to continue this chat Nick we should either take it off-list or to a new thread. =)

I agree with Chris Lasher's basic point, that the representation of bytes confusingly contradicts the idea that bytes are bytes. But it is not going to change. On 9/10/2014 3:56 AM, Cory Benfield wrote:
On 10 September 2014 08:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
memoryview.cast can be a potentially useful tool for that :)
Sure, and so can binascii.hexlify (which is what I normally use).
See http://bugs.python.org/issue9951 to add bytes.hex or .tohex as more of less the inverse of bytes.fromhex or even have hex(bytes) work. This change *is* possible and I think we should pick one of the suggestions for 3.5. -- Terry Jan Reedy

On Sep 10, 2014, at 08:42 AM, Cory Benfield wrote:
I doubt you'll get far with this proposal on this list, which is a shame because I think you have a point. There is an impedance mismatch between the Python community saying "Bytes are not text" and the fact that, wow, they really do look like they are sometimes!
That's the nature of wire protocols - they're like quantum particles, exhibiting both bytes-like and string-like behavior. You can't look too closely, and they have spooky action at a distance too. For the email protocols at least, you also have mind-crushing singularities. -Barry

On 10 September 2014 15:48, Barry Warsaw <barry@python.org> wrote:
That's the nature of wire protocols - they're like quantum particles, exhibiting both bytes-like and string-like behavior. You can't look too closely, and they have spooky action at a distance too. For the email protocols at least, you also have mind-crushing singularities.
Well, it's the nature of *many* wire protocols. Binary protocols are increasing in popularity at the moment, because it turns out that "kinda-text-like" wire protocols are a nightmare to parse correctly. Thus, the Python decision is great for SMTP and HTTP/1.1, and infuriating for things like HTTP/2.

Barry Warsaw writes:
On Sep 10, 2014, at 08:42 AM, Cory Benfield wrote:
I doubt you'll get far with this proposal on this list, which is a shame because I think you have a point. There is an impedance mismatch between the Python community saying "Bytes are not text" and the fact that, wow, they really do look like they are sometimes!
So does 0xDEADBEEF, but actually that's *not* text, it's a 32-bit pointer, conveniently invalid on most 32-bit architectures and very obvious when it shows up in a backtrace. Do you see an impedence mismatch in the C community because of that? In fact, *all* bytes "look like text", because *you can't see them until they're converted to text by repr()*! This is the key to the putative "impedence mismatch" -- it's perceived as such when people don't distinguish the map from the territory. The issue that sometimes it's easier to read hex than ASCII mixed with other stuff (hex escapes or Latin-1) is true enough, though. But it's not about an impedence mismatch, it's a question of what does *this* developer consider to be the convenient repr for *that* task. I just don't see hex-based use cases coming close to being as important as the convenience for those cases where the structure being imposed on some bytes is partly derived from English. The current default repr is, I believe, the right default repr. That doesn't mean that it would be a terrible idea to provide other reprs in the stdlib (although it is after all a one-liner!)
That's the nature of wire protocols - they're like quantum particles, exhibiting both bytes-like and string-like behavior.
I find the analogy picturesque but unconvincing. Wire protocols are punctuated *by design* with European (mostly English) words, acronyms, and abbreviations, because (a) it's convenient for syntax to be mnemonic, (b) because the arbitrary standard for network streams is octets, and you can't fit much more than an English character into an octet, and (c) historically, English-speakers got there first (and had economic hegemony on their side, too).
You can't look too closely, and they have spooky action at a distance too. For the email protocols at least, you also have mind-crushing singularities.
Doom, gloom, DMARC, and boom! But I guess you were referring to From-stuffing, not From-munging.<wink/>

On 10 September 2014 17:59, Stephen J. Turnbull <stephen@xemacs.org> wrote:
So does 0xDEADBEEF, but actually that's *not* text, it's a 32-bit pointer, conveniently invalid on most 32-bit architectures and very obvious when it shows up in a backtrace. Do you see an impedence mismatch in the C community because of that?
In fact, *all* bytes "look like text", because *you can't see them until they're converted to text by repr()*! This is the key to the putative "impedence mismatch" -- it's perceived as such when people don't distinguish the map from the territory.
I apologise, I was insufficiently clear. I mean that interaction with the bytes type in Python has a lot of textual aspects to it. This is a *deliberate* decision (or at least the documentation makes it seem deliberate), and I can understand the rationale, but it's hard to be surprised that it leads developers astray. Also, while I'm being picky, 0xDEADBEEF is not a 32-bit pointer, it's a 32-bit something. Its type is undefined in that expression. It has a standard usage as a guard word, but still, let's not jump to conclusions here! I accept your core point, however, which I consider to be this:
The issue that sometimes it's easier to read hex than ASCII mixed with other stuff (hex escapes or Latin-1) is true enough, though. But it's not about an impedence mismatch, it's a question of what does *this* developer consider to be the convenient repr for *that* task.
This is definitely true, which I believe I've already admitted in this thread. I do happen to believe that having it be hex would provide a better pedagogical position ("you know this isn't text because it looks like gibberish!"), but that ship sailed a long time ago.

I originally wrote this late last night but realized today that I only sent this reply to Terry Reedy, not to python-ideas. (Apologies, Terry – I didn't mean to single you out with my rant!) I'm reposting it in full, below. Some of these ideas have already been raised by others and counter-arguments already posed. I still feel I have not seen some of these points directly addressed, namely, the unreasonableness of seeing bytes from floating point numbers as ASCII characters, and the sanity of the API I counter-propose. Message now appears below: On Wed, Sep 10, 2014 at 1:11 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I agree with Chris Lasher's basic point, that the representation of bytes confusingly contradicts the idea that bytes are bytes. But it is not going to change.
Unless printable representation of bytes objects appears as part of the language specification for Python 3, it's an implementation detail, thus, it is a candidate for change, especially if the BDFL wills it so. Consider me optimistic that we can change it, or I would have just posted yet another "Python 3 gets it all wrong" blog post to the web instead of writing this pre-proposal. :-)
On 9/10/2014 3:56 AM, Cory Benfield wrote:
On 10 September 2014 08:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
memoryview.cast can be a potentially useful tool for that :)
Sure, and so can binascii.hexlify (which is what I normally use).
See http://bugs.python.org/issue9951 to add bytes.hex or .tohex as more of less the inverse of bytes.fromhex or even have hex(bytes) work. This change *is* possible and I think we should pick one of the suggestions for 3.5.
Here's the API Issue 9951 is proposing: >>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'Hello, World!' >>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.tohex() b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' >>> b'Hello, World!' b'Hello, World!' >>> b'Hello, World!'.tohex() b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' I'll tell you what: here's the API of my counter-proposal: >>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' >>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.asciify() b'Hello, World!' >>> b'Hello, World!' b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' >>> b'Hello, World!'.asciify() b'Hello, World!' Here's the prose description of my counter-proposal: add a method to the bytes object called `.asciify`, that returns a printable representation of the bytes, where bytes mapping to printable ASCII characters are displayed as ASCII characters, and the remainder are given as hex codes. That is, .asciify() should round-trip a bytes literal. This frees up repr() to do what universally makes sense on a series of bytes: state the bytes! Marc-Andre Lemburg said:
A definite -1 from me on making repr(b"Hello World") harder to read than necessary.
Okay, but a definite -1e6 from me on making my Python interpreter do this: >>> my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12, 1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955) >>> my_packed_bytes b'Why, Guido? Why?' I do understand the utility of peering in to ASCII text, but like Cory Benfield stated earlier:
I'm saying that I don't get to do debugging with a simple print statement when using the bytes type to do actual binary work, while those who are doing sort-of binary work do.
Does the inconvenience of having to explicitly call the .asciify() method on a bytes object justify the current behavior for repr() on a bytes object? The privilege of being lazy is obstructing the right to see what we've actually got in the bytes object, and is jeopardizing the very argument that "bytes are not strings". On Wed, Sep 10, 2014 at 10:51 AM, Cory Benfield <cory@lukasa.co.uk> wrote:
On 10 September 2014 17:59, Stephen J. Turnbull <stephen@xemacs.org> wrote:
So does 0xDEADBEEF, but actually that's *not* text, it's a 32-bit pointer, conveniently invalid on most 32-bit architectures and very obvious when it shows up in a backtrace. Do you see an impedence mismatch in the C community because of that?
In fact, *all* bytes "look like text", because *you can't see them until they're converted to text by repr()*! This is the key to the putative "impedence mismatch" -- it's perceived as such when people don't distinguish the map from the territory.
I apologise, I was insufficiently clear. I mean that interaction with the bytes type in Python has a lot of textual aspects to it. This is a *deliberate* decision (or at least the documentation makes it seem deliberate), and I can understand the rationale, but it's hard to be surprised that it leads developers astray.
Also, while I'm being picky, 0xDEADBEEF is not a 32-bit pointer, it's a 32-bit something. Its type is undefined in that expression. It has a standard usage as a guard word, but still, let's not jump to conclusions here!
I accept your core point, however, which I consider to be this:
The issue that sometimes it's easier to read hex than ASCII mixed with other stuff (hex escapes or Latin-1) is true enough, though. But it's not about an impedence mismatch, it's a question of what does *this* developer consider to be the convenient repr for *that* task.
This is definitely true, which I believe I've already admitted in this thread. I do happen to believe that having it be hex would provide a better pedagogical position ("you know this isn't text because it looks like gibberish!"), but that ship sailed a long time ago. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Sep 10, 2014, at 11:35, Chris Lasher <chris.lasher@gmail.com> wrote:
I originally wrote this late last night but realized today that I only sent this reply to Terry Reedy, not to python-ideas. (Apologies, Terry – I didn't mean to single you out with my rant!)
I'm reposting it in full, below. Some of these ideas have already been raised by others and counter-arguments already posed. I still feel I have not seen some of these points directly addressed, namely, the unreasonableness of seeing bytes from floating point numbers as ASCII characters, and the sanity of the API I counter-propose.
Message now appears below:
On Wed, Sep 10, 2014 at 1:11 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I agree with Chris Lasher's basic point, that the representation of bytes confusingly contradicts the idea that bytes are bytes. But it is not going to change.
Unless printable representation of bytes objects appears as part of the language specification for Python 3, it's an implementation detail, thus, it is a candidate for change, especially if the BDFL wills it so. Consider me optimistic that we can change it, or I would have just posted yet another "Python 3 gets it all wrong" blog post to the web instead of writing this pre-proposal. :-)
On 9/10/2014 3:56 AM, Cory Benfield wrote:
On 10 September 2014 08:45, Nick Coghlan <ncoghlan@gmail.com> wrote:
memoryview.cast can be a potentially useful tool for that :)
Sure, and so can binascii.hexlify (which is what I normally use).
See http://bugs.python.org/issue9951 to add bytes.hex or .tohex as more of less the inverse of bytes.fromhex or even have hex(bytes) work. This change *is* possible and I think we should pick one of the suggestions for 3.5.
Here's the API Issue 9951 is proposing:
b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'Hello, World!' b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.tohex() b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'Hello, World!' b'Hello, World!' b'Hello, World!'.tohex() b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
I'll tell you what: here's the API of my counter-proposal:
b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.asciify() b'Hello, World!' b'Hello, World!' b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'Hello, World!'.asciify() b'Hello, World!'
It strikes me that we should have both asciify and hexlify (or whatever we call them) so people can be explicit when debugging; the question then becomes which one repr calls. At which point it really is just a question of which group of developers (those working on HTTP/2.0 or those working on HTTP/1.1, for example) get to be "lazy" instead of explicit in their debugging. The argument in favor of "asciify" is that the hex representation is more purist. The argument in favor of "hexlify" is that it makes Python 3.6 do the same thing as 3.0-3.5, and in fact 1.0-2.7 as well; people have had a few decades to get used to being lazy with mostly-ASCII protocols, while people have had a few decades to get used to being explicit with pure-binary protocols. But maybe there's another potential concern that can help decide. A lot of novices using bytes get confused when they see b'\x05Hello' and ask questions about how to deal with that 8-character string. (You can see them all over StackOverflow, for example.) Of course the same people also ask how to get the b out of their string, etc.; obviously they need to be taught the difference between a bytes and its repr no matter what. Would switching to hexlify as a default help those people by forcing them to confront their confusion early, or slow them down by not letting them write a lot of simple code and learn other important stuff before getting to that confusion? I that the answer to that might be as compelling as the answer to which group of experienced developers (where the groups often overlap) deserves to be allowed to be lazy. But I don't have the answer...
Here's the prose description of my counter-proposal: add a method to the bytes object called `.asciify`, that returns a printable representation of the bytes, where bytes mapping to printable ASCII characters are displayed as ASCII characters, and the remainder are given as hex codes. That is, .asciify() should round-trip a bytes literal. This frees up repr() to do what universally makes sense on a series of bytes: state the bytes!
Marc-Andre Lemburg said:
A definite -1 from me on making repr(b"Hello World") harder to read than necessary.
Okay, but a definite -1e6 from me on making my Python interpreter do this:
my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12, 1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955) my_packed_bytes b'Why, Guido? Why?'
I do understand the utility of peering in to ASCII text, but like Cory Benfield stated earlier:
I'm saying that I don't get to do debugging with a simple print statement when using the bytes type to do actual binary work, while those who are doing sort-of binary work do.
Does the inconvenience of having to explicitly call the .asciify() method on a bytes object justify the current behavior for repr() on a bytes object? The privilege of being lazy is obstructing the right to see what we've actually got in the bytes object, and is jeopardizing the very argument that "bytes are not strings".
On Wed, Sep 10, 2014 at 10:51 AM, Cory Benfield <cory@lukasa.co.uk> wrote:
On 10 September 2014 17:59, Stephen J. Turnbull <stephen@xemacs.org> wrote:
So does 0xDEADBEEF, but actually that's *not* text, it's a 32-bit pointer, conveniently invalid on most 32-bit architectures and very obvious when it shows up in a backtrace. Do you see an impedence mismatch in the C community because of that?
In fact, *all* bytes "look like text", because *you can't see them until they're converted to text by repr()*! This is the key to the putative "impedence mismatch" -- it's perceived as such when people don't distinguish the map from the territory.
I apologise, I was insufficiently clear. I mean that interaction with the bytes type in Python has a lot of textual aspects to it. This is a *deliberate* decision (or at least the documentation makes it seem deliberate), and I can understand the rationale, but it's hard to be surprised that it leads developers astray.
Also, while I'm being picky, 0xDEADBEEF is not a 32-bit pointer, it's a 32-bit something. Its type is undefined in that expression. It has a standard usage as a guard word, but still, let's not jump to conclusions here!
I accept your core point, however, which I consider to be this:
The issue that sometimes it's easier to read hex than ASCII mixed with other stuff (hex escapes or Latin-1) is true enough, though. But it's not about an impedence mismatch, it's a question of what does *this* developer consider to be the convenient repr for *that* task.
This is definitely true, which I believe I've already admitted in this thread. I do happen to believe that having it be hex would provide a better pedagogical position ("you know this isn't text because it looks like gibberish!"), but that ship sailed a long time ago. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Wed, Sep 10, 2014 at 12:27 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
It strikes me that we should have both asciify and hexlify (or whatever we call them) so people can be explicit when debugging; the question then becomes which one repr calls.
Well said, and I agree both methods should be added. Explicit is better than implicit," here, to me, trumps, "There should be one and only one obvious way to do it." Using these methods should be preferred when one needs to actually store the results. repr() is, to me, meant as a convenience function for the programmer to inspect her data structure, and is not meant to be relied upon as a shortcut to string representation in production code. But perhaps others here disagree and think repr() can and should be used in production code.
The argument in favor of "asciify" is that the hex representation is more purist.
The argument in favor of "hexlify" is that it makes Python 3.6 do the same thing as 3.0-3.5, and in fact 1.0-2.7 as well; people have had a few decades to get used to being lazy with mostly-ASCII protocols, while people have had a few decades to get used to being explicit with pure-binary protocols.
Again, very well said!
But maybe there's another potential concern that can help decide. A lot of novices using bytes get confused when they see b'\x05Hello'
I guess I wasn't clear: this is precisely why I've raised this issue. I promise I'm not trying to make life harder for folks using Python 3 to work with HTTP/1.1! I'm trying to lower the barrier of comprehension to those who have not used Python 3, and especially those who have never programmed before in their life. I have teach these people, in my local Python meetup group, in Software Carpentry courses, and one-on-one with junior developers in my company. Put yourself in the shoes of a beginner. If Python does this >>> bytes(range(15)) b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e' To understand this, you have to learn just two things: 1. Bytes is a sequence of integers between the range of 0 and 255. 2. How to translate base-10 integers into hexadecimal. You have to see this through the eyes of a beginner to see this >>> bytes(range(15)) b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e' Now you have four things to explain! 1. Bytes is a sequence of integers between the range of 0 and 255. 2. How to translate base-10 integers into hexadecimal. 3. How ASCII provides a mapping between some integers and English characters 4. The conditions under which you'll see an ASCII character in place of a hexadecimal value versus the hexadecimal value itself It's easier to teach a student how to decode bytes into ASCII characters when the student can see the bytes, then the resulting ASCII characters in the string, in a one-to-one fashion. It is deeply confusing when they inspect the bytes in the REPL and already see the ASCII characters. The natural question is, "But I already see the character, so why do I have to decode it?!" The current behavior of repr() on bytes puts an unfair cognitive burden on novices (and those of us working with "pure binary" files) compared to the gains to advanced programmers who already can comprehend the mapping of bytes to ASCII characters and can manage the mixture of the two. Think of the children! :-)

On 11 Sep 2014 06:30, "Chris Lasher" <chris.lasher@gmail.com> wrote:
Put yourself in the shoes of a beginner.
We often compromise the beginner experience for backwards compatibility reasons, or to provide a better developer experience in the long run (cf. changing print from a statement to a builtin function). In this case, I *agree* the current behaviour is confusing, since it recreates some of the old "is it binary or is it text?" confusion that was more endemic in Python 2. In Python 3, "bytes" is still a hybrid type that can hold: * arbitrary binary data * binary data that contains ASCII segments A pure teaching language wouldn't make that compromise. Python 3 isn't a pure teaching language though - it's a pragmatic professional programming language that is *also* useful for teaching. The problem is that for a lot of data it is *genuinely ambiguous* as to which of those it actually is (and it may change at runtime depending on the specific nature of the data). Both the default repr and the literal form assume the "binary data ASCII compatible segments", which aligns with the behaviour of the Python 2 str type. That isn't going to change in Python, especially since we actually *did* try it for a while (prior to the 3.0 release) and really didn't like it. However, as others have noted, making it easier to get a pure hex representation is likely worth doing. There are lots of ways of doing that currently, but none that really qualify as "obvious". Cheers, Nick.

On Wed, Sep 10, 2014 at 3:09 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
In Python 3, "bytes" is still a hybrid type that can hold:
* arbitrary binary data * binary data that contains ASCII segments
Let me be clear. Here are things this proposal does NOT include: * Removing string-like methods from bytes * Removing ASCII from bytes literals Those have proven incredibly useful to the Python community. I appreciate that. This proposal does not take these behaviors away from bytes. Here's what my proposal DOES include: 1. Adjust the behavior of repr() on a bytes instance such that only hexadecimal codes appear. The returned value would be the text displaying the bytes literal of hexadecimal codes that would reproduce the bytes instance. 2. Provide a method (suggested: "bytes.asciify") that returns a printable representation of bytes that replaces bytes whose values map to printable ASCII glyphs with the glyphs. The returned value would be the text displaying the bytes literal of ASCII glyphs and hexadecimal codes that would reproduce the bytes instance. If you liked the behavior of repr() on bytes in Python 3.0 through 3.4 (or 3.5), it's still available via this method call! 3. Optionally, provide a method (suggested: "bytes.hexlify") which implements the code for creating the printable representation of the bytes with hexadecimal values only, and call this method in bytes.__repr__.
Both the default repr and the literal form assume the "binary data ASCII compatible segments", which aligns with the behaviour of the Python 2 str type. That isn't going to change in Python, especially since we actually *did* try it for a while (prior to the 3.0 release) and really didn't like it.
Yes, more specifically you said:
Early (pre-release) versions of Python 3.0 didn't have this behaviour, and getting the raw integer dumps instead turned out to be *really* annoying in practice, so we decided the easier debugging justified the increased risk of creating incorrect mental models for users (especially those migrating from Python 2).
What you haven't said so far, however, and what I still don't know, is whether or not the core team has already tried providing a method on bytes objects à la the proposed .asciify() for projecting bytes as ASCII characters, and rejected that on the basis of it being too inconvenient for the vast majority of Python use cases. Did the core team try this, before deciding that this should be the result from repr() should automatically rewrite printable ASCII characters in place of hex values for bytes? So far, I've heard a lot of requests to keep the behavior because it's convenient. But how inconvenient is it to call bytes.asciify()? Are those not in favor of changing the behavior of repr() really going to sit behind the argument that the effort expended in typing ten more characters ought to guarantee that thousands of other programmers are going to have to figure out why there's letters in their bytes – or rather, how there's actually NOT letters in their bytes? And once again, we are talking about changing behavior that is unspecified by the Python 3 language specification. The language is gaining a reputation for confusing the two, however, as written by Armin Ronacher [1]: Python is definitely a language that is not perfect. However I think what
frustrates me about the language are largely problems that have to do with tiny details in the interpreter and less the language itself. These interpreter details however are becoming part of the language and this is why they are important.
I feel passionately this implicit ASCII-translation behavior should not propagate into further releases CPython 3, and I don't want to see it become a de facto specification due to calcification. We're talking about the next 10 to 15 years. Nobody guaranteed the behavior of repr() so far. With the bytes.asciify() method (or whatever it may be called), we have a fair compromise, plus a more explicit specification of behavior of bytes in Python 3. In closing on this message, I want to say that I appreciate you hearing me out, Nick. I have appreciated your answers, and certainly the historical background. And thanks to the others who have contributed here. I appreciate you taking the time to discuss this. Chris L. [1] http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see/

FWIW, I find the ascii-mixed-with-hex difficult to parse, even though I know full-well what it is, and I could easily live with having a 'bytes.asciify' and 'bytes.hexlify' and have the __repr__ be something more consistent -- maybe a list of ints, that way nobody gets to be lazy! ;) -- ~Ethan~

On 11 September 2014 09:23, Chris Lasher <chris.lasher@gmail.com> wrote:
On Wed, Sep 10, 2014 at 3:09 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
In Python 3, "bytes" is still a hybrid type that can hold:
* arbitrary binary data * binary data that contains ASCII segments
Let me be clear. Here are things this proposal does NOT include:
* Removing string-like methods from bytes * Removing ASCII from bytes literals
Those have proven incredibly useful to the Python community. I appreciate that. This proposal does not take these behaviors away from bytes.
Here's what my proposal DOES include:
1. Adjust the behavior of repr() on a bytes instance such that only hexadecimal codes appear. The returned value would be the text displaying the bytes literal of hexadecimal codes that would reproduce the bytes instance.
This is not an acceptable change, for two reasons: 1. It's a *major* compatibility break. It breaks single source Python 2/3 development, it breaks doctests, it breaks user expectations. 2. It breaks the symmetry between the bytes literal format and their representation. It's important to remember we changed *from* a pure binary representation back to the current hybrid representation. It's not an accident or oversight, it's a deliberate design choice, and the reasons driving that original decision haven't changed in the last 8+ years.
2. Provide a method (suggested: "bytes.asciify") that returns a printable representation of bytes that replaces bytes whose values map to printable ASCII glyphs with the glyphs. The returned value would be the text displaying the bytes literal of ASCII glyphs and hexadecimal codes that would reproduce the bytes instance. If you liked the behavior of repr() on bytes in Python 3.0 through 3.4 (or 3.5), it's still available via this method call!
Except that method call won't be available in Python 2 code, and thus not usable in single source Python 2/3 code bases. That's still an incredibly important environment for people to be able to program in, and we're generally aiming to make the common subset *bigger* in Python 3.5 (e.g. by adding bytes.__mod__), not smaller.
3. Optionally, provide a method (suggested: "bytes.hexlify") which implements the code for creating the printable representation of the bytes with hexadecimal values only, and call this method in bytes.__repr__.
As per the discussion on issue 9951, it is likely Python 3.5 will either offer bytes.hex() and bytearray.hex() methods (and perhaps even memoryview.hex()). I have also filed issue 22385 to propose allowing the "x" and "X" string formatting characters (for str.format and the format builtin) to accept arbitrary bytes-like objects. *Additive* changes like that to make it easier to work with pure binary data are relatively non-controversial (although there may still be some argument over *which* of those changes are worth including).
What you haven't said so far, however, and what I still don't know, is whether or not the core team has already tried providing a method on bytes objects à la the proposed .asciify() for projecting bytes as ASCII characters, and rejected that on the basis of it being too inconvenient for the vast majority of Python use cases.
That option was never really on the table, as once we decided back to switch to a hybrid ASCII representation, the obvious design model to use was the Python 2 str type, which has inherently hybrid behaviour, and uses the literal form for the "obj == eval(repr(obj))" round trip.
Did the core team try this, before deciding that this should be the result from repr() should automatically rewrite printable ASCII characters in place of hex values for bytes?
So far, I've heard a lot of requests to keep the behavior because it's convenient. But how inconvenient is it to call bytes.asciify()? Are those not in favor of changing the behavior of repr() really going to sit behind the argument that the effort expended in typing ten more characters ought to guarantee that thousands of other programmers are going to have to figure out why there's letters in their bytes – or rather, how there's actually NOT letters in their bytes?
No, we're not keeping it because it's convenient, we're keeping it because changing it would be a major compatibility break for (at best) a small reduction in beginner confusion. This change simply wouldn't provide sufficient benefit to justify the massive scale of the disruption it would cause. By contrast, adding better *binary* representation tools is easy (they pose no backwards compatibility challenges), and hence the preferred choice. When teaching beginners, explaining the difference between: >>> b"abc" b'abc' >>> b"abc".hex() '616263' Is likely to be pretty straightforward (and will teach them the relevant concept of ASCII based vs hexadecimal representations for binary data). Consider the proposed alternative, which is to instead have to explain: >>> b"abc" b'\x61\x62\x63' >>> b"abc".hex() '616263' >>> b"abc".ascii() 'abc' That's 3 different representations when there are only two underlying concepts to be learned.
And once again, we are talking about changing behavior that is unspecified by the Python 3 language specification.
Something being underspecified in the language specification doesn't mean we have free rein to change it on a whim - sometimes it just means there's an assumed detail that hasn't been explicitly stated, but implementors of alternative implementations hadn't previously commented on the omission because they just followed the behaviour of CPython as the reference interpreter, or the requirements of the regression test suite. It's really necessary to look at the regression test suite, along with the written specification, as things that aren't part of the language spec are marked as "CPython only". Cases where it's CPython that is out of line when other interpreter implementations discover a compatibility issue get filed as CPython bugs (like the one where we sometimes get the operand precedence wrong if both sequences in a binary concatenation operation are implemented in C and the sequences are of different types). In this case, the underspecification relates to the fact that for builtin types that have dedicated syntax, the expectation is that their repr will use that dedicated syntax. This is not currently stated explicitly in the language reference (and I agree it probably should be), but it's tested extensively by the regression test suite, so it becomes a backwards compatibility constraint and an alternative interpreter compatibility constraint.
The language is gaining a reputation for confusing the two
It isn't "gaining" that reputation, it has always had it. The reputation for it is actually *reducing* over time, as we spend more time working with other implementations like PyPy, Jython and IronPython to get the CPython implementation details marked appropriately. (C)Python itself hasn't changed in this regard - we're just starting to do a better job of getting the wildly divergent groups of users actually talking to each other (with occasional fireworks as people have to come to grips with some radically different viewpoints on the nature and purpose of software development). In particular, we're starting to see folks that had previously focused almost entirely on the application programming and network service development side of Python (which tends to heavily abstract away the C layer) start to learn more about the system orchestration, hardware automation and scientific programming side of Python that lets you dive as deeply into the machine internals as you like. Most language runtimes only let you handle one or the other of those categories well - CPython is a relatively rare breed in supporting both, which *does* have consequences that make many of our design decisions seem weird to folks that aren't looking at *all* the use cases for the language in general, and the CPython runtime in particular.
however, as written by Armin Ronacher [1]:
Python is definitely a language that is not perfect. However I think what frustrates me about the language are largely problems that have to do with tiny details in the interpreter and less the language itself. These interpreter details however are becoming part of the language and this is why they are important.
I feel passionately this implicit ASCII-translation behavior should not propagate into further releases CPython 3, and I don't want to see it become a de facto specification due to calcification.
It's not a de facto specification it's a deliberate design choice, made before Python 3.0 was even released, and captured by the regression test suite.
We're talking about the next 10 to 15 years. Nobody guaranteed the behavior of repr() so far. With the bytes.asciify() method (or whatever it may be called), we have a fair compromise, plus a more explicit specification of behavior of bytes in Python 3.
Lots of folks don't like the fact that CPython doesn't completely hide the underlying memory model of C from the user - it's a deliberately leaky abstraction. The approach certainly has its downsides, but that leaky abstraction is what allows people to be confident that they can use Python as a convenient orchestration language, knowing that we will have easy access to the kind of low level control offered by C (and other systems programming languages) if we need it. This is why the scientific Python stack currently works best on CPython, with the ports to PyPy, Jython and IronPython (which all abstract away the C layer far more heavily) at varying stages of maturity - it's simply harder to do array oriented programming in those environments, since the language runtimes weren't built with that use case in mind (neither was CPython, but the relatively close coupling to the C layer enabled the capability anyway). Computers are complicated layers of messy and leaky abstractions. Working too hard at hiding those layers from the user just means developers can't bypass the abstraction easily when they know what they need for their current use case better than the original author of the language runtime. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 09/10/2014 05:09 PM, Nick Coghlan wrote:
On 11 Sep 2014 06:30, "Chris Lasher" <chris.lasher@gmail.com <mailto:chris.lasher@gmail.com>> wrote:
Put yourself in the shoes of a beginner.
We often compromise the beginner experience for backwards compatibility reasons, or to provide a better developer experience in the long run (cf. changing print from a statement to a builtin function).
In this case, I *agree* the current behaviour is confusing, since it recreates some of the old "is it binary or is it text?" confusion that was more endemic in Python 2.
In Python 3, "bytes" is still a hybrid type that can hold: * arbitrary binary data * binary data that contains ASCII segments
A pure teaching language wouldn't make that compromise. Python 3 isn't a pure teaching language though - it's a pragmatic professional programming language that is *also* useful for teaching.
The problem is that for a lot of data it is *genuinely ambiguous* as to which of those it actually is (and it may change at runtime depending on the specific nature of the data).
Considering "genuinely ambiguous", if it was a new feature we might quote... "In the face of ambiguity, refuse the temptation to guess." It's interesting that there is nothing in the zen rules about change or backward compatibility. If there were, it might have said... "Changing too much, too fast, is often too disruptive".
Both the default repr and the literal form assume the "binary data ASCII compatible segments", which aligns with the behaviour of the Python 2 str type. That isn't going to change in Python, especially since we actually *did* try it for a while (prior to the 3.0 release) and really didn't like it.
However, as others have noted, making it easier to get a pure hex representation is likely worth doing. There are lots of ways of doing that currently, but none that really qualify as "obvious".
When working with hex data, I prefer the way hex editors do it. With pairs of hex digits separated by a space. "50 79 74 68 6f 6e" b'Python' But I'm not sure there's a way to make that work cleanly. :-/ Cheers, Ron

On 11 September 2014 11:36, Ron Adam <ron3200@gmail.com> wrote:
When working with hex data, I prefer the way hex editors do it. With pairs of hex digits separated by a space.
"50 79 74 68 6f 6e" b'Python'
But I'm not sure there's a way to make that work cleanly. :-/
I realised (http://bugs.python.org/issue22385) we could potentially support that style through the string formatting syntax, using the precision field to specify the number of "bytes per chunk", along with a couple of the other existing formatting flags in the mini-language: format(b"xyz", "x") -> '78797a' format(b"xyz", "X") -> '78797A' format(b"xyz", "#x") -> '0x78797a' format(b"xyz", ".1x") -> '78 79 7a' format(b"abcdwxyz", ".4x") -> '61626364 7778797a' format(b"abcdwxyz", "#.4x") -> '0x61626364 0x7778797a' format(b"xyz", ",.1x") -> '78,79,7a' format(b"abcdwxyz", ",.4x") -> '61626364,7778797a' format(b"abcdwxyz", "#,.4x") -> '0x61626364,0x7778797a' Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 09/10/2014 09:40 PM, Nick Coghlan wrote:
When working with hex data, I prefer the way hex editors do it. With pairs of hex digits separated by a space.
"50 79 74 68 6f 6e" b'Python'
But I'm not sure there's a way to make that work cleanly. :-/ I realised (http://bugs.python.org/issue22385) we could potentially support that style through the string formatting syntax, using the
On 11 September 2014 11:36, Ron Adam<ron3200@gmail.com> wrote: precision field to specify the number of "bytes per chunk", along with a couple of the other existing formatting flags in the mini-language:
format(b"xyz", "x") -> '78797a' format(b"xyz", "X") -> '78797A' format(b"xyz", "#x") -> '0x78797a'
format(b"xyz", ".1x") -> '78 79 7a' format(b"abcdwxyz", ".4x") -> '61626364 7778797a' format(b"abcdwxyz", "#.4x") -> '0x61626364 0x7778797a'
format(b"xyz", ",.1x") -> '78,79,7a' format(b"abcdwxyz", ",.4x") -> '61626364,7778797a' format(b"abcdwxyz", "#,.4x") -> '0x61626364,0x7778797a'
Is there a way to go in the other direction? That is these other hex formats to bytes? Ron

On Thu, Sep 11, 2014 at 9:42 AM, Ron Adam <ron3200@gmail.com> wrote:
On 09/10/2014 09:40 PM, Nick Coghlan wrote:
On 11 September 2014 11:36, Ron Adam<ron3200@gmail.com> wrote:
When working with hex data, I prefer the way hex editors do it. With pairs of hex digits separated by a space.
"50 79 74 68 6f 6e" b'Python'
But I'm not sure there's a way to make that work cleanly. :-/
I realised (http://bugs.python.org/issue22385) we could potentially support that style through the string formatting syntax, using the precision field to specify the number of "bytes per chunk", along with a couple of the other existing formatting flags in the mini-language:
format(b"xyz", "x") -> '78797a' format(b"xyz", "X") -> '78797A' format(b"xyz", "#x") -> '0x78797a'
format(b"xyz", ".1x") -> '78 79 7a' format(b"abcdwxyz", ".4x") -> '61626364 7778797a' format(b"abcdwxyz", "#.4x") -> '0x61626364 0x7778797a'
format(b"xyz", ",.1x") -> '78,79,7a' format(b"abcdwxyz", ",.4x") -> '61626364,7778797a' format(b"abcdwxyz", "#,.4x") -> '0x61626364,0x7778797a'
Is there a way to go in the other direction? That is these other hex formats to bytes?
Yes, for the forms not prefixed with '0x':
bytes.fromhex('78797A') b'xyz' bytes.fromhex('78797a') b'xyz' bytes.fromhex('78 79 7a') b'xyz' bytes.fromhex('0x78797a') Traceback (most recent call last): File "<ipython-input-11-1cc0b196bfc3>", line 1, in <module> bytes.fromhex('0x78797a') ValueError: non-hexadecimal number found in fromhex() arg at position 0

Nick Coghlan writes:
In Python 3, "bytes" is still a hybrid type that can hold: * arbitrary binary data * binary data that contains ASCII segments
A pure teaching language wouldn't make that compromise.
Of course it would, because nobody in their right mind would restrict a bytes type to the values 128-255! Yes, I know what you mean: it wouldn't use the hybrid representation for repr or for literals. My point is that even you are making the mistake of framing the issue as whether a bytes object is "arbitrary binary data" or "binary data that contains [readable] ASCII segments" as something inherent in the type. It's not! It's all about convenience of representation for particular applications, end of story. A repr that obfuscates the content in the "ASCII segment" set of applications *might* be preferable for teaching applications, but I'm not even sure of that.

On Thu, Sep 11, 2014 at 11:17:19AM +0900, Stephen J. Turnbull wrote:
It's all about convenience of representation for particular applications, end of story. A repr that obfuscates the content in the "ASCII segment" set of applications *might* be preferable for teaching applications, but I'm not even sure of that.
I think it is telling that hex editors, as a general rule, display byte data (i.e. the content of files) as both hex and ASCII. Real-world data is messy, and there are many cases where we want to hunt through an otherwise binary file looking for sequences of ASCII characters. Or visa versa. That's inherently mixing the concepts of text and bytes, but it needs to be done sometimes. I am sad that the default representation of bytes displays ASCII, but I am also convinced that as regrettable as that choice is, the opposite choice would be even more regrettable. So I will be satisfied by an obvious way to display the hexified representation of a byte-string, even if that way is not repr(). -- Steven

On Thu, Sep 11, 2014 at 4:35 AM, Chris Lasher <chris.lasher@gmail.com> wrote:
Unless printable representation of bytes objects appears as part of the language specification for Python 3, it's an implementation detail, thus, it is a candidate for change, especially if the BDFL wills it so.
So this is all about the output of repr(), right? The question then is: How important is backward compatibility with repr? Will there be code breakage? I've generally considered repr() to be exclusively "take this object and turn it into something a human can use".Nothing about what the exact string returned is. Something like this description: """Any value, debug style. Do not rely on the exact formatting; how the result looks can vary depending on locale, phase of the moon or anything else the lfun::_sprintf() method implementor wanted for debugging.""" (Replace lfun::_sprintf() with __repr__() for that to make sense for Python.) If repr's meant to be treated that way, then there's no problem changing bytes.__repr__ to produce hex-only output in 3.5 or 3.6. If it's NOT meant to be treated as opaque (and I've seen some Stack Overflow posts where people are parsing repr()), then what is the guarantee? ChrisA

On 11 September 2014 10:42, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Sep 11, 2014 at 4:35 AM, Chris Lasher <chris.lasher@gmail.com> wrote:
Unless printable representation of bytes objects appears as part of the language specification for Python 3, it's an implementation detail, thus, it is a candidate for change, especially if the BDFL wills it so.
So this is all about the output of repr(), right? The question then is: How important is backward compatibility with repr? Will there be code breakage?
I changed PyBytes_Repr to inject a 'Z' after the opening quote to see just how extensive the damage would be in CPython's own regression test suite (as I belatedly realised the magnitude of the impact may not be obvious to everyone, so I figured it was worth quantifying): 355 tests OK. 17 tests failed: test_base64 test_bytes test_configparser test_ctypes test_doctest test_file_eintr test_hash test_io test_pdb test_pickle test_pickletools test_re test_smtpd test_subprocess test_sys test_telnetlib test_tools 1 test altered the execution environment: test_warnings 17 tests skipped: test_curses test_devpoll test_kqueue test_msilib test_ossaudiodev test_smtpnet test_socketserver test_startfile test_timeout test_tk test_ttk_guionly test_urllib2net test_urllibnet test_winreg test_winsound test_xmlrpc_net test_zipfile64 I ran those tests without enabling *any* of the optional resources (and the Windows specific tests won't run on my machine). Folks should keep in mind that when we talk about "hybrid ASCII binary data", we're not just talking about things like SMTP and HTTP 1.1 and debugging network protocol traffic, we're also talking about things like URLs, filesystem paths, email addresses, environment variables, command line arguments, process names, passing UTF-8 encoded data to GUI frameworks, etc that are often both ASCII compatible and human readable *by design*. Note the error message produced here with my modified build: $ ./python -c 'import os; print(os.listdir(b"foo"))' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: b'Zfoo' And this directory listing: $ ./python -c 'import os; print(os.listdir(b"Mac"))' [b'ZIDLE', b'ZMakefile.in', b'ZTools', b'ZREADME.orig', b'ZPythonLauncher', b'ZIcons', b'ZREADME', b'ZExtras.install.py', b'ZBuildScript', b'ZResources'] Python 3 carved out a whole lot of text processing operations and said "these are clearly and unambiguous working with text data, we shouldn't confuse them with binary data manipulation". The remaining ambiguity in the behaviour of the Python 3 bytes type is largely inherent in the way computers currently work - there's no getting away from it. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 11 September 2014 11:57, Nick Coghlan <ncoghlan@gmail.com> wrote:
Folks should keep in mind that when we talk about "hybrid ASCII binary data", we're not just talking about things like SMTP and HTTP 1.1 and debugging network protocol traffic, we're also talking about things like URLs, filesystem paths, email addresses, environment variables, command line arguments, process names, passing UTF-8 encoded data to GUI frameworks, etc that are often both ASCII compatible and human readable *by design*.
Note the error message produced here with my modified build:
$ ./python -c 'import os; print(os.listdir(b"foo"))' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: b'Zfoo'
And this directory listing:
$ ./python -c 'import os; print(os.listdir(b"Mac"))' [b'ZIDLE', b'ZMakefile.in', b'ZTools', b'ZREADME.orig', b'ZPythonLauncher', b'ZIcons', b'ZREADME', b'ZExtras.install.py', b'ZBuildScript', b'ZResources']
After posting that version, I realised actually making the proposed change would be similarly straightforward, and better illustrate the core problem with the idea: $ ./python -c 'import os; print(os.listdir(b"foo"))' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: b'\x66\x6f\x6f' $ ./python -c 'import os; print(os.listdir(b"Mac"))' [b'\x49\x44\x4c\x45', b'\x4d\x61\x6b\x65\x66\x69\x6c\x65\x2e\x69\x6e', b'\x54\x6f\x6f\x6c\x73', b'\x52\x45\x41\x44\x4d\x45\x2e\x6f\x72\x69\x67', b'\x50\x79\x74\x68\x6f\x6e\x4c\x61\x75\x6e\x63\x68\x65\x72', b'\x49\x63\x6f\x6e\x73', b'\x52\x45\x41\x44\x4d\x45', b'\x45\x78\x74\x72\x61\x73\x2e\x69\x6e\x73\x74\x61\x6c\x6c\x2e\x70\x79', b'\x42\x75\x69\x6c\x64\x53\x63\x72\x69\x70\x74', b'\x52\x65\x73\x6f\x75\x72\x63\x65\x73'] vs $ python3 -c 'import os; print(os.listdir(b"foo"))' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: 'foo' $ python3 -c 'import os; print(os.listdir(b"Mac"))' [b'IDLE', b'Makefile.in', b'Tools', b'README.orig', b'PythonLauncher', b'Icons', b'README', b'Extras.install.py', b'BuildScript', b'Resources'] It's more than just a matter of backwards compatibility, it's a matter of asymmetry of impact when the two possible design choices are wrong: * Using a hex based repr when an ASCII based repr is more appropriate is utterly unreadable * Using an ASCII based repr when a hex based repr is more appropriate is somewhat confusing This kind of thing is why the original "binary representation by default" design didn't survive the Python 3.0 development cycle - once people started trying it out, it quickly became evident that it was the wrong approach to take (if I remember the original implementation correctly, the repr was along the lines of "bytes([1, 2, 3, 4])" since there wasn't a bytes literal until after PEP 3137 was implemented). Making hex representations of binary data easier to produce is still a good idea, though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Let me start with this, from Nick:
This is not an acceptable change, for two reasons:
1. It's a *major* compatibility break. It breaks single source Python 2/3 development, it breaks doctests, it breaks user expectations.
Okay, breaking doctests, I can understand the negative impact. I'm willing to give up because of this. So, on account of the fragility of doctests, I suppose, yes, this proposal will never go through. And I feel that's a shame, because I was never a fan of doctests, either. Regarding user expectations, I've already stated, yes this continues with the expectations of experienced users, who won't stumble when they see ASCII in their bytes. For all other users, though, this behavior otherwise violates the principle of least astonishment. ("Why are there English characters in my bytes?") 2. It breaks the symmetry between the bytes literal format and their
representation.
Symmetry is already broken for bytes literal format because the user is allowed to enter hex codes, even if they map onto printable ASCII characters: >>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21' b'Hello, World!' On Wed, Sep 10, 2014 at 7:35 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
After posting that version, I realised actually making the proposed change would be similarly straightforward, and better illustrate the core problem with the idea:
$ ./python -c 'import os; print(os.listdir(b"foo"))' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: b'\x66\x6f\x6f' $ ./python -c 'import os; print(os.listdir(b"Mac"))' [b'\x49\x44\x4c\x45', b'\x4d\x61\x6b\x65\x66\x69\x6c\x65\x2e\x69\x6e', b'\x54\x6f\x6f\x6c\x73', b'\x52\x45\x41\x44\x4d\x45\x2e\x6f\x72\x69\x67', b'\x50\x79\x74\x68\x6f\x6e\x4c\x61\x75\x6e\x63\x68\x65\x72', b'\x49\x63\x6f\x6e\x73', b'\x52\x45\x41\x44\x4d\x45', b'\x45\x78\x74\x72\x61\x73\x2e\x69\x6e\x73\x74\x61\x6c\x6c\x2e\x70\x79', b'\x42\x75\x69\x6c\x64\x53\x63\x72\x69\x70\x74', b'\x52\x65\x73\x6f\x75\x72\x63\x65\x73']
You passed bytes – not an ASCII string – as an argument to os.listdir; it gave you back bytes, not ASCII strings. You _consented_ to bytes when you put the b'Mac' in there; therefore, you are responsible for decoding those bytes. Yes, all text must be represented an bytes to a computer, but not all bytes represent text.
It's more than just a matter of backwards compatibility, it's a matter of asymmetry of impact when the two possible design choices are wrong:
* Using a hex based repr when an ASCII based repr is more appropriate is utterly unreadable * Using an ASCII based repr when a hex based repr is more appropriate is somewhat confusing
I prefer to unframe it from ASCII. The decision is (well, was) between: * A representation that is always accurate but sometimes inconvenient versus * A representation is convenient when it is accurate, but is not always accurate (and is inconvenient when it's inaccurate). Earlier, Nick, you wrote
What you haven't said so far, however, and what I still don't know, is whether or not the core team has already tried providing a method on bytes objects à la the proposed .asciify() for projecting bytes as ASCII characters, and rejected that on the basis of it being too inconvenient for the vast majority of Python use cases.
That option was never really on the table, as once we decided back to
switch to a hybrid ASCII representation, the obvious design model to use was the Python 2 str type, which has inherently hybrid behaviour, and uses the literal form for the "obj == eval(repr(obj))" round trip.
obj == eval(repr(obj)) round-trip behavior is not violated by the proposed change >>> r = repr(b'Hello, World!') "b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'" >>> b'Hello, World!' == eval(r) True

On 11 September 2014 16:47, Chris Lasher <chris.lasher@gmail.com> wrote:
You passed bytes – not an ASCII string – as an argument to os.listdir; it gave you back bytes, not ASCII strings. You _consented_ to bytes when you put the b'Mac' in there; therefore, you are responsible for decoding those bytes.
Yes, all text must be represented an bytes to a computer, but not all bytes represent text.
Yes, we know. We debated this 8 years ago. We *tried it* 8 years ago. We found it to provide a horrible developer experience, so we changed it back to be closer to the way Python 2 works. Changing the default representation of binary data to something that we already decided didn't work (or at least its very close cousin) is not up for discussion. Providing better tools for easily producing hexadecimal representations is an excellent idea. Making developers explicitly request non-horrible output when working with binary APIs on POSIX systems is not. You can keep saying "but it's potentially confusing when it really is arbitrary binary data", and I'm telling you *that doesn't matter*. The consequences of flipping the default are worse, because it means defaulting to unreadable output from supported operating system interfaces, which *will* leak through to API consumers and potentially even end users. That's not OK, which means the status quo is the lesser of two evils. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Sep 10, 2014 at 11:47:01PM -0700, Chris Lasher wrote:
Regarding user expectations, I've already stated, yes this continues with the expectations of experienced users, who won't stumble when they see ASCII in their bytes. For all other users, though, this behavior otherwise violates the principle of least astonishment. ("Why are there English characters in my bytes?")
That's easy to explain: "Because Python gets used for many programming tasks where ASCII text is mixed in with arbitrary bytes, as a convenience for those programmers, as well as backward compatibility with the Bad Old Days, Python defaults to show bytes as if they were ASCII text. But they're not, of course, under the hood they're just numbers between 0 and 255, or 0 and 0xFF in hexadecimal, and you can see that by calling the hexify() method." There are plenty of other areas of Python where decisions are made that are not necessarily ideal from a teaching standpoint. We cope. -- Steven

On 11 Sep 2014, at 02:42, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Sep 11, 2014 at 4:35 AM, Chris Lasher <chris.lasher@gmail.com> wrote:
Unless printable representation of bytes objects appears as part of the language specification for Python 3, it's an implementation detail, thus, it is a candidate for change, especially if the BDFL wills it so.
So this is all about the output of repr(), right? The question then is: How important is backward compatibility with repr? Will there be code breakage?
It’s likely to break doctests at least. Wichert.

Chris Lasher writes:
Okay, but a definite -1e6 from me on making my Python interpreter do this:
>>> my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12, 1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955) >>> my_packed_bytes b'Why, Guido? Why?'
If you actually have a struct, why aren't you wrapping your_packed_bytes in a class that validates the struct and displays it nicely formatted? Or, alternatively, simply replaces __repr__?
I do understand the utility of peering in to ASCII text, but like Cory Benfield stated earlier:
I'm saying that I don't get to do debugging with a simple print statement when using the bytes type to do actual binary work, while those who are doing sort-of binary work do.
Does the inconvenience of having to explicitly call the .asciify() method on a bytes object justify the current behavior for repr() on a bytes object?
Yes. A choice must be made, because a type has only one repr, and there's no syntax for choosing it. It's a question of whose use case is going to become more convenient and whose becomes less so, and either choice is *justified*. Which is *preferred* is a judgment call. Your judgment doesn't rule, and it definitely doesn't have a weight of 1e6. At this point even Guido's judgment is likely to be dominated by backward compatibility, no matter how much he regrets the necessity. (But I would bet he doesn't regret it at all.)
The privilege of being lazy is obstructing the right to see what we've actually got in the bytes object, and is jeopardizing the very argument that "bytes are not strings".
It does not jeopardize the *fact* that bytes are not strings. People who don't understand that have a fundamental confusion, and they're going to want bytes to DWIM when mixed with str in their applications. And they'll complain when their bytes don't DWIM, and they'll complain even more when the repr "obstructs the right to see what they've actually got in the bytes object", which (in their applications) is a stream containing tokens borrowed from English using the ASCII coded character set. I agree with you that they're wrong. My point is that they're wrong in such a way that they won't understand that bytes aren't text strings any better merely because they become harder to read. They *know* that there's a text string in there because they put it there! Cory Benfield wrote and Chris Lasher quoted:
Also, while I'm being picky, 0xDEADBEEF is not a 32-bit pointer, it's a 32-bit something. Its type is undefined in that It has a standard usage as a guard word, but still, let's not jump to conclusions here!
I was not jumping to conclusions. I was setting up a scenario. The actual use case is something like "int *pi = 0xDEADBEEF;". The point is that C programmers are deliberately choosing a guard word that is readable when printed as hexadecimal, and also satisfies certain restrictions when those bytes are used as a pointer. That doesn't mean that they are confusing text with pointers. The same is true for Python's repr for bytes.
I do happen to believe that having it be hex would provide a better pedagogical position ("you know this isn't text because it looks like gibberish!"), but that ship sailed a long time ago.
I don't think a gibberish repr will confuse people who think that bytes are text in their application. They'll just get more peeved at Python 3, because they know that there's readable text in there, and Python 3 "obstructs their right to see what's actually in the bytes object". Regards,

On Wed, Sep 10, 2014 at 6:50 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Chris Lasher writes:
Okay, but a definite -1e6 from me on making my Python interpreter do this:
>>> my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12, 1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955) >>> my_packed_bytes b'Why, Guido? Why?'
If you actually have a struct, why aren't you wrapping your_packed_bytes in a class that validates the struct and displays it nicely formatted? Or, alternatively, simply replaces __repr__?
The point was to demonstrate that although text must be represented by bytes, not all bytes represent text. I have the bytes from four 32-bit floating point numbers, but repr() displays these bytes as ASCII characters. It looks like I wrote "Why, Guido? Why?" illustrating how implicit behavior that's "usually helpful" can be rather unhelpful. Explicitly showing the hexadecimal values is always accurate, because bytes are always bytes.
Your judgment doesn't rule, and it definitely doesn't have a weight of 1e6.
I meant the "-1e6" as a cheeky response, not as a reflection of the importance of my opinions or ideas.

On Wed, Sep 10, 2014 at 12:04:23AM -0700, Chris Lasher wrote: [...]
I would like to gauge the feasibility of a PEP to change the printable representation of bytes in CPython 3 to display all elements by their hexadecimal values, and only by their hexadecimal values.
I'm very sympathetic to this "purity" approach. I too consider it a shame that the repr of byte-strings in Python 3 pretends to be ASCII-ish[1], regardless of the source of the bytes. Alas, not only do we have backward compatibility to consider -- there are now five versions of Python 3 where bytes display as ASCII -- but practicality as well. There are many use-cases where human-readable ASCII bytes are embedded inside otherwise binary bytes. To my regret, I don't think purity arguments are strong enough to justify a change. However, I do support Terry's suggestion that bytes (and, I presume, bytearray) grow some sort of easy way of displaying the bytes in hex. The trouble is, what do we actually want? b'Abc' --> '0x416263' b'Abc' --> '\x41\x62\x63' I can see use-cases for both. After less than two minutes of thought, it seems to me that perhaps the most obvious APIs for these two different representations are: hex(b'Abc') --> '0x416263' b'Abc'.decode('hexescapes') --> '\x41\x62\x63' [1] They're not *strictly* ASCII, since ASCII doesn't support ordinal values above 127. -- Steven

On 09/10/2014 12:57 PM, Steven D'Aprano wrote:
On Wed, Sep 10, 2014 at 12:04:23AM -0700, Chris Lasher wrote:
[...]
I would like to gauge the feasibility of a PEP to change the printable representation of bytes in CPython 3 to display all elements by their hexadecimal values, and only by their hexadecimal values.
I'm very sympathetic to this "purity" approach. I too consider it a shame that the repr of byte-strings in Python 3 pretends to be ASCII-ish[1], regardless of the source of the bytes. Alas, not only do we have backward compatibility to consider -- there are now five versions of Python 3 where bytes display as ASCII -- but practicality as well. There are many use-cases where human-readable ASCII bytes are embedded inside otherwise binary bytes. To my regret, I don't think purity arguments are strong enough to justify a change.
However, I do support Terry's suggestion that bytes (and, I presume, bytearray) grow some sort of easy way of displaying the bytes in hex. The trouble is, what do we actually want?
b'Abc' --> '0x416263' b'Abc' --> '\x41\x62\x63'
I can see use-cases for both. After less than two minutes of thought, it seems to me that perhaps the most obvious APIs for these two different representations are:
hex(b'Abc') --> '0x416263'
This would require a change in the documented (https://docs.python.org/3/library/functions.html#hex) behavior of hex(), which I think is quite a big deal for a relatively special case.
b'Abc'.decode('hexescapes') --> '\x41\x62\x63'
This, OTOH, looks elegant (avoids a new method) and clear (no doubt about the returned type) to me. +1 Wolfgang

On Wed, Sep 10, 2014 at 6:54 AM, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> wrote:
I can see use-cases for both. After less than two minutes of thought, it seems to me that perhaps the most obvious APIs for these two different representations are:
hex(b'Abc') --> '0x416263'
This would require a change in the documented (https://docs.python.org/3/library/functions.html#hex) behavior of hex(), which I think is quite a big deal for a relatively special case.
I agree that we should leave hex alone.
b'Abc'.decode('hexescapes') --> '\x41\x62\x63'
This, OTOH, looks elegant (avoids a new method) and clear (no doubt about the returned type) to me. +1
Another +0.5 for me. I think this is quite elegant and reasonable. I'm not sure it needs to be unicode though. Perhaps it's too early for me, but does turning that into a unicode string make sense?

On Thu, Sep 11, 2014 at 12:24 AM, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
b'Abc'.decode('hexescapes') --> '\x41\x62\x63'
This, OTOH, looks elegant (avoids a new method) and clear (no doubt about the returned type) to me. +1
Another +0.5 for me. I think this is quite elegant and reasonable. I'm not sure it needs to be unicode though. Perhaps it's too early for me, but does turning that into a unicode string make sense?
It's becoming text. What other type makes more sense than a text string? ChrisA

On 10 September 2014 15:24, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
b'Abc'.decode('hexescapes') --> '\x41\x62\x63'
This, OTOH, looks elegant (avoids a new method) and clear (no doubt about the returned type) to me. +1
Another +0.5 for me. I think this is quite elegant and reasonable. I'm not sure it needs to be unicode though. Perhaps it's too early for me, but does turning that into a unicode string make sense?
It's easy enough to do by hand:
print(''.join("\\x{:02x}".format(c) for c in b'Abc')) \x41\x62\x63
And you get any other format you like, just by changing the format string in there, or the string you join on:
print(':'.join("{:02x}".format(c) for c in b'Abc')) 41:62:63
Not every one-liner needs to be a builtin... Paul

On Wed, Sep 10, 2014 at 03:37:03PM +0100, Paul Moore wrote:
On 10 September 2014 15:24, Ian Cordasco <graffatcolmingov@gmail.com> wrote:
b'Abc'.decode('hexescapes') --> '\x41\x62\x63'
This, OTOH, looks elegant (avoids a new method) and clear (no doubt about the returned type) to me. +1
Another +0.5 for me. I think this is quite elegant and reasonable. I'm not sure it needs to be unicode though. Perhaps it's too early for me, but does turning that into a unicode string make sense?
repr() returns a unicode string. hex(), oct() and bin() return unicode strings. The intent is to return a human-readable representation of a binary object, that is, a string from a bytes object. So, yes, a unicode string makes sense.
It's easy enough to do by hand:
print(''.join("\\x{:02x}".format(c) for c in b'Abc')) \x41\x62\x63
And you get any other format you like, just by changing the format string in there, or the string you join on:
print(':'.join("{:02x}".format(c) for c in b'Abc')) 41:62:63
Not every one-liner needs to be a builtin...
Until your post just now, there has probably never been anyone anywhere who wanted to display b'Abc' as "41:62:63", and there probably never will be again. For such a specialised use-case, it's perfectly justified to reject a request for such a colon-delimited hex function with "not every one-liner...". But displaying bytes as either "0x416263" or "\x41\x62\x63" hex format is not so obscure, especially if you consider pedagogical uses. For that, your one-liner is hardly convenient: you have to manually walk the bytes objects, extracting one byte at a time, format it, debug the inevitable mistake in the formatting code *wink*, then join all the substrings. The complexity of the code (little as it is for an expert) is enough to distract from the pedagogical message, and not quite trivially simple to get right if you aren't a heavy user of string formatting codes. Converting byte strings to a hex representation is quite a common thing to do, as witnessed by the (at least) five different ways to do it: http://bugs.python.org/msg226731 none of which are really obvious or convenient. Hence the long- outstanding request for this. (At least four years now.) -- Steven

On 11 September 2014 08:30, Steven D'Aprano <steve@pearwood.info> wrote:
print(':'.join("{:02x}".format(c) for c in b'Abc')) 41:62:63
Not every one-liner needs to be a builtin...
Until your post just now, there has probably never been anyone anywhere who wanted to display b'Abc' as "41:62:63", and there probably never will be again. For such a specialised use-case, it's perfectly justified to reject a request for such a colon-delimited hex function with "not every one-liner...".
So I picked a bad example. Sorry. Someone (sorry, I can't recall who) did ask for
print(' '.join("{:02x}".format(c) for c in b'Abc')) 41 62 63
My point is that a simple pattern is flexible, a specific method has to pick one "obvious" representation, and there have been a number of representations discussed here. Paul

On 09/11/2014 12:30 AM, Steven D'Aprano wrote:
Until your post just now, there has probably never been anyone anywhere who wanted to display b'Abc' as "41:62:63", and there probably never will be again. For such a specialised use-case, it's perfectly justified to reject a request for such a colon-delimited hex function with "not every one-liner...".
Make that two. :) Space or colon delimited is far easier to read than no separator, or the noise of a \x separator. -- ~Ethan~

On Wed, Sep 10, 2014 at 01:54:17PM +0200, Wolfgang Maier wrote:
On 09/10/2014 12:57 PM, Steven D'Aprano wrote:
However, I do support Terry's suggestion that bytes (and, I presume, bytearray) grow some sort of easy way of displaying the bytes in hex. The trouble is, what do we actually want?
b'Abc' --> '0x416263' b'Abc' --> '\x41\x62\x63'
I can see use-cases for both. After less than two minutes of thought, it seems to me that perhaps the most obvious APIs for these two different representations are:
hex(b'Abc') --> '0x416263'
This would require a change in the documented (https://docs.python.org/3/library/functions.html#hex) behavior of hex(), which I think is quite a big deal for a relatively special case.
Any new functionality is going to require a change to the documentation. Changing hex() is no more of a big deal than adding a new method. I'd call it *less* of a big deal. In Python 2, hex() calls the dunder method __hex__. That has been removed in Python 3. Does anyone know why? As I see it, hex() returns a hexadecimal representation of its argument as a string. That's exactly what we want in this case: we're taking an object which represents a block of integer-values, and want a human- readable hexadecimal representation. So hex() is, or ought to be, the obvious solution. As an alternative, if there was an easy, obvious way to convert the bytes b'Abc' (or b'\x41\x62\x63') to the int 4285027 (or 0x416263), then the obvious solution would be hex(int(b'Abc')) and it would require no changes to hex(). Of course the int() built-in isn't the right way to do this. -- Steven

On 09/10/2014 11:55 PM, Steven D'Aprano wrote:
On Wed, Sep 10, 2014 at 01:54:17PM +0200, Wolfgang Maier wrote:
On 09/10/2014 12:57 PM, Steven D'Aprano wrote:
However, I do support Terry's suggestion that bytes (and, I presume, bytearray) grow some sort of easy way of displaying the bytes in hex. The trouble is, what do we actually want?
b'Abc' --> '0x416263' b'Abc' --> '\x41\x62\x63'
I can see use-cases for both. After less than two minutes of thought, it seems to me that perhaps the most obvious APIs for these two different representations are:
hex(b'Abc') --> '0x416263'
This would require a change in the documented (https://docs.python.org/3/library/functions.html#hex) behavior of hex(), which I think is quite a big deal for a relatively special case.
Any new functionality is going to require a change to the documentation.
Changing hex() is no more of a big deal than adding a new method. I'd call it *less* of a big deal.
In Python 2, hex() calls the dunder method __hex__. That has been removed in Python 3. Does anyone know why?
__hex__ and __oct__ were removed in favor of __index__. __index__ returns the number as an integer (if possible to do so without conversion from, say, float or complex or ...). __hex__ and __oct__ did the same, and were redundant. -- ~Ethan~

On Mon, Sep 15, 2014 at 07:24:17AM -0700, Ethan Furman wrote:
In Python 2, hex() calls the dunder method __hex__. That has been removed in Python 3. Does anyone know why?
__hex__ and __oct__ were removed in favor of __index__. __index__ returns the number as an integer (if possible to do so without conversion from, say, float or complex or ...). __hex__ and __oct__ did the same, and were redundant.
No, __hex__ returned a string. It could be used to implement (say) a floating point hex representation, or hex() of bytes. py> (42).__hex__() '0x2a' In Python 2, hex() only had to return a string, and accepted anything with a __hex__ method. In Python 3, it can only be used on objects which are int-like, which completely rules out conversions of non-ints to hexadecimal notation. py> class MyList(list): ... def __hex__(self): ... return '[' + ', '.join(hex(a) for a in self) + ']' ... py> l = MyList([21, 16, 256, 73]) py> hex(l) '[0x15, 0x10, 0x100, 0x49]' Pity. I don't suppose anyone would support bringing back __hex__? -- Steven

On 09/15/2014 07:44 AM, Steven D'Aprano wrote:
On Mon, Sep 15, 2014 at 07:24:17AM -0700, Ethan Furman wrote:
In Python 2, hex() calls the dunder method __hex__. That has been removed in Python 3. Does anyone know why?
__hex__ and __oct__ were removed in favor of __index__. __index__ returns the number as an integer (if possible to do so without conversion from, say, float or complex or ...). __hex__ and __oct__ did the same, and were redundant.
No, __hex__ returned a string. It could be used to implement (say) a floating point hex representation, or hex() of bytes.
Right, sorry. I had the wrong return type in mind. Now you have to use the hex format codes.
I don't suppose anyone would support bringing back __hex__?
I don't think we need another formatting operator. we already have % and .format() -- do we still have string templates? -- ~Ethan~

On 16 Sep 2014 01:17, "Ethan Furman" <ethan@stoneleaf.us> wrote:
I don't think we need another formatting operator. we already have % and .format() -- do we still have string templates?
Yes, but those were designed for a specific use case where the templates are written by language translators rather than software developers. The current suggestion on the issue tracker is to add __format__ to bytes/bytearray/memoryview with a suitable symbolic mini-language to control the formatting details. Thrashing out a mini-language design will likely require a PEP, though. Cheers, Nick.
-- ~Ethan~
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 09/15/2014 04:47 PM, Nick Coghlan wrote:
The current suggestion on the issue tracker is to add __format__ to bytes/bytearray/memoryview with a suitable symbolic mini-language to control the formatting details.
PEP 461 specifically did not add back __format__ to bytes/bytearrays. I think a PEP is appropriate to reverse that decision. -- ~Ethan~

On 09/15/2014 08:43 PM, Ethan Furman wrote:
On 09/15/2014 04:47 PM, Nick Coghlan wrote:
The current suggestion on the issue tracker is to add __format__ to bytes/bytearray/memoryview with a suitable symbolic mini-language to control the formatting details.
PEP 461 specifically did not add back __format__ to bytes/bytearrays. I think a PEP is appropriate to reverse that decision.
That's different. PEP 461 excluded them because it was talking about bytes.format(). bytes.__format__() would be much easier to deal with, because its result must be unicode (str in 3.x). I don't think just adding bytes/bytearray.__format__() per se requires a PEP. It's not a very radical addition, similar to datetime.__format__(). But I wouldn't be opposed to a PEP to decide on the specifics of the mini-language that bytes.__format__() supports. Eric.

On 09/16/2014 06:02 AM, Eric V. Smith wrote:
On 09/15/2014 08:43 PM, Ethan Furman wrote:
On 09/15/2014 04:47 PM, Nick Coghlan wrote:
The current suggestion on the issue tracker is to add __format__ to bytes/bytearray/memoryview with a suitable symbolic mini-language to control the formatting details.
PEP 461 specifically did not add back __format__ to bytes/bytearrays. I think a PEP is appropriate to reverse that decision.
That's different. PEP 461 excluded them because it was talking about bytes.format(). bytes.__format__() would be much easier to deal with, because its result must be unicode (str in 3.x).
I don't think just adding bytes/bytearray.__format__() per se requires a PEP. It's not a very radical addition, similar to datetime.__format__(). But I wouldn't be opposed to a PEP to decide on the specifics of the mini-language that bytes.__format__() supports.
So the difference is: b'Hello, %s' % some_bytes_var --> b'Hello, <whatever>' whilst b'Hello, {}'.format(some_uni_var) --> u'Hello, <whatever>' (Yes, I remember unicode == str, I was just being explicit ;) That would certainly go along with the idea that `format` is for strings. -- ~Ethan~

TBH I've lost track what this thread is about, but if any actionable proposals come out, please send them my way in the form of a PEP. -- --Guido van Rossum (python.org/~guido)

On Sep 16, 2014, at 8:04, Ethan Furman <ethan@stoneleaf.us> wrote:
On 09/16/2014 06:02 AM, Eric V. Smith wrote:
On 09/15/2014 08:43 PM, Ethan Furman wrote:
On 09/15/2014 04:47 PM, Nick Coghlan wrote:
The current suggestion on the issue tracker is to add __format__ to bytes/bytearray/memoryview with a suitable symbolic mini-language to control the formatting details.
PEP 461 specifically did not add back __format__ to bytes/bytearrays. I think a PEP is appropriate to reverse that decision.
That's different. PEP 461 excluded them because it was talking about bytes.format(). bytes.__format__() would be much easier to deal with, because its result must be unicode (str in 3.x).
I don't think just adding bytes/bytearray.__format__() per se requires a PEP. It's not a very radical addition, similar to datetime.__format__(). But I wouldn't be opposed to a PEP to decide on the specifics of the mini-language that bytes.__format__() supports.
So the difference is:
b'Hello, %s' % some_bytes_var --> b'Hello, <whatever>'
whilst
b'Hello, {}'.format(some_uni_var) --> u'Hello, <whatever>'
No, you're mixing up `format`, an explicit method on str that no one is suggesting adding to bytes, and `__format__`, a dunder method on every type that's used by `str.format` and `format`; the proposal is to extend `bytes.__format__` in some way that I don't think is entirely decided yet, but it would look something like this: u'Hello, {:a}'.format(some_bytes_var) --> u'Hello, <whatever>' Or: u'Hello, {:#x}'.format(some_bytes_var) --> u'Hello, \\x2d\\x78\\x68\\x61...'
(Yes, I remember unicode == str, I was just being explicit ;)
That would certainly go along with the idea that `format` is for strings.
-- ~Ethan~ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 09/16/2014 01:09 PM, Andrew Barnert wrote:
No, you're mixing up `format`, an explicit method on str that no one is suggesting adding to bytes, and `__format__`, a dunder method on every type that's used by `str.format` and `format`; the proposal is to extend `bytes.__format__` in some way that I don't think is entirely decided yet, but it would look something like this:
u'Hello, {:a}'.format(some_bytes_var) --> u'Hello, <whatever>'
Or:
u'Hello, {:#x}'.format(some_bytes_var) --> u'Hello, \\x2d\\x78\\x68\\x61...'
Ah, that makes more sense, thanks for the clarification! -- ~Ethan~

On 9/16/2014 4:09 PM, Andrew Barnert wrote:
the proposal is to extend `bytes.__format__`
Currently bytes just inherits object.__format__
bytes.__format__.__qualname__ 'object.__format__'
object.__format__ does not allow a non-empty format string.
'a{}b'.format(b'c\xdd') "ab'c\\xdd'b" 'a{:a}b'.format(b'c\xdd') Traceback (most recent call last): File "<pyshell#26>", line 1, in <module> 'a{:a}b'.format(b'c\xdd') TypeError: non-empty format string passed to object.__format__
in some way that I don't think is entirely decided yet, but it would look something like this:
u'Hello, {:a}'.format(some_bytes_var) --> u'Hello, <whatever>'
Or:
u'Hello, {:#x}'.format(some_bytes_var) --> u'Hello, \\x2d\\x78\\x68\\x61...'
-- Terry Jan Reedy

On 17 September 2014 06:09, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
No, you're mixing up `format`, an explicit method on str that no one is suggesting adding to bytes, and `__format__`, a dunder method on every type that's used by `str.format` and `format`; the proposal is to extend `bytes.__format__` in some way that I don't think is entirely decided yet, but it would look something like this:
u'Hello, {:a}'.format(some_bytes_var) --> u'Hello, <whatever>'
Or:
u'Hello, {:#x}'.format(some_bytes_var) --> u'Hello, \\x2d\\x78\\x68\\x61...'
Ignoring the specifics of the minilanguage, here are the examples I posted to http://bugs.python.org/issue22385: format(b"xyz", "x") -> '78797a' format(b"xyz", "X") -> '78797A' format(b"xyz", "#x") -> '0x78797a' format(b"xyz", ".1x") -> '78 79 7a' format(b"abcdwxyz", ".4x") -> '61626364 7778797a' format(b"abcdwxyz", "#.4x") -> '0x61626364 0x7778797a' format(b"xyz", ",.1x") -> '78,79,7a' format(b"abcdwxyz", ",.4x") -> '61626364,7778797a' format(b"abcdwxyz", "#,.4x") -> '0x61626364,0x7778797a' The point on the issue tracker was that while this is a good way to obtain the flexibility, adhering too closely to the "standard format syntax" as I did likely isn't a good idea. Instead, we'd be better going for the strftime model where a type specific format (e.g. as an argument to the new *.hex() methods being discussed in http://bugs.python.org/issue) is *also* supported via __format__. For example, inspired directly by the way hex editors work, you could envision an approach where you had a base format character (chosen to be orthogonal to the default format characters): "h": lowercase hex "H": uppercase hex "A": ASCII (using "." for unprintable & extended ASCII) format(b"xyz", "A") -> 'xyz' format(b"xyz", "h") -> '78797a' format(b"xyz", "H") -> '78797A' Followed by a separator and "chunk size": format(b"xyz", "h 1") -> '78 79 7a' format(b"abcdwxyz", "h 4") -> '61626364 7778797a' format(b"xyz", "h,1") -> '78,79,7a' format(b"abcdwxyz", "h,4") -> '61626364,7778797a' format(b"xyz", "h:1") -> '78:79:7a' format(b"abcdwxyz", "h:4") -> '61626364:7778797a' In the "h" and "H" cases, you could request a preceding "0x" on the chunks: format(b"xyz", "h#") -> '0x78797a' format(b"xyz", "h# 1") -> '0x78 0x79 0x7a' format(b"abcdwxyz", "h# 4") -> '0x61626364 0x7778797a' The section before the format character would use the standard string formatting rules: alignment, fill character, width, precision Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 09/17/2014 06:57 AM, Nick Coghlan wrote:
The point on the issue tracker was that while this is a good way to obtain the flexibility, adhering too closely to the "standard format syntax" as I did likely isn't a good idea. Instead, we'd be better going for the strftime model where a type specific format (e.g. as an argument to the new *.hex() methods being discussed in http://bugs.python.org/issue) is *also* supported via __format__.
One thing I'd like to not support here that strftime does: arbitrary pass-through text in the format string. It's useful for date/time, but not here. And your examples below don't allow it, I just wanted to be clear.
For example, inspired directly by the way hex editors work, you could envision an approach where you had a base format character (chosen to be orthogonal to the default format characters):
"h": lowercase hex "H": uppercase hex "A": ASCII (using "." for unprintable & extended ASCII)
format(b"xyz", "A") -> 'xyz' format(b"xyz", "h") -> '78797a' format(b"xyz", "H") -> '78797A'
Followed by a separator and "chunk size":
format(b"xyz", "h 1") -> '78 79 7a' format(b"abcdwxyz", "h 4") -> '61626364 7778797a'
format(b"xyz", "h,1") -> '78,79,7a' format(b"abcdwxyz", "h,4") -> '61626364,7778797a'
format(b"xyz", "h:1") -> '78:79:7a' format(b"abcdwxyz", "h:4") -> '61626364:7778797a'
In the "h" and "H" cases, you could request a preceding "0x" on the chunks:
format(b"xyz", "h#") -> '0x78797a' format(b"xyz", "h# 1") -> '0x78 0x79 0x7a' format(b"abcdwxyz", "h# 4") -> '0x61626364 0x7778797a'
The section before the format character would use the standard string formatting rules: alignment, fill character, width, precision
I think that's too confusing. For example, '#' is also allowed before the format character: [[fill]align][sign][#][0][width][,][.precision][type] And precision doesn't make sense for bytes (and is currently not allowed for int). So you'd instead have the complete format specifier be: [[fill]align][sign][#][0][width][type][#][internal-fill][chunksize] I think "sign" might have to go: it doesn't make sense. Not sure about "0". Let's say they both go, and we're left with: [[fill]align][width][type][#][internal-fill][chunksize] I'm not completely opposed to this, but I think we can do better. I see basically 3 options for byte format specifiers: 1. Support exactly what the standard types (int, str, float, etc.) support, but give slightly different semantics to it. This is what Nick originally proposed on the issue tracker. 2. Support a slightly different format specifier. This is what Nick proposes above, and I discuss more below. The downside of this is that it might be confusing to some users, who see the printf-like formatting as some universal standard. It's also hard to document. 3. Do something radically different. I gave an example on the issue tracker, but I'm not totally serious about this. Here's my proposal for #2: The format specifier becomes: [[fill]align][#][width][separator]][/chunksize][type] The only difference here (from the standard format specifier) is that I've substituted "/chunksize" for ".precision", and generalized the separator. I think "/" reads well as "divide into chunks this size". We might want to restrict "separator" to a few characters, maybe one of space, colon, dot, comma. I think Nick's usage of 'A', 'H', and 'h' for the "type" character is good, although I'd really prefer 'a'. And it's possible 'x' and 'X' would be less confusing (because it's more familiar), but maybe it does increase confusion. Let's keep 'h' and 'H' for now, just for discussion purposes. So, Nick's examples become: format(b"xyz", "a") -> 'xyz' format(b"xyz", "h") -> '78797a' format(b"xyz", "H") -> '78797A' Followed by a separator and "chunk size": format(b"xyz", "/1h") -> '78 79 7a' format(b"abcdwxyz", "/4h") -> '61626364 7778797a' format(b"xyz", ",/1h") -> '78,79,7a' format(b"abcdwxyz", ",/4h") -> '61626364,7778797a' format(b"xyz", ":/1h") -> '78:79:7a' format(b"abcdwxyz", ":/4h") -> '61626364:7778797a' format(b"xyz", "#h") -> '0x78797a' format(b"xyz", "#/1h") -> '0x78 0x79 0x7a' format(b"abcdwxyz", "#/4h") -> '0x61626364 0x7778797a' I really haven't thought through parsing this format specifier. Obviously "separator" will have some restrictions, like it can't be a slash. I'll have to give it some more thought. As with the standard format specifiers, there are some restrictions. 'a' couldn't have '#', for example. But I don't see why it couldn't have 'chunksize'. Eric.

On Sep 17, 2014, at 4:48, "Eric V. Smith" <eric@trueblade.com> wrote:
2. Support a slightly different format specifier. This is what Nick proposes above, and I discuss more below. The downside of this is that it might be confusing to some users, who see the printf-like formatting as some universal standard. It's also hard to document.
The possibility of confusion might be increased if some of the options to bytes look like they should work for str. People will ask, "I can chunk bytes into groups of 4 with /4, why can't I do that with characters when the rest of the format specifier is the same?" Also, are there other languages that use printf-style specifiers and have %x, with the same options as for int, working for their bytes type? IIRC Go lets you print strings as numbers, as if their UTF-8 representation were a giant big-endian integer; if it's just a consequence of little-used feature in a language that's nobody's first that probably won't add to confusion, but if it's more common and widespread, it might be worth either matching what the others do, or deliberately being as different as possible to prevent confusion. Nick's use of 'h' instead of 'x' and his rearrangement of the fields definitely avoids giving the appearance of being printf-like and any confusion that might cause, while still being able to share fields that make sense. But of course avoiding printf-like-ness means it's a new thing people have to learn. (Of course eventually they want to do something where the format isn't identical to printf, and many of them seem to go to StackOverflow or IRC and complain that there's a "bug in str.format" instead of just glancing at the docs, so maybe making them learn early isn't such a bad thing...)

Andrew Barnert writes:
The possibility of confusion might be increased if some of the options to bytes look like they should work for str. People will ask, "I can chunk bytes into groups of 4 with /4, why can't I do that with characters when the rest of the format specifier is the same?"
Isn't the answer to that kind of question "because you haven't written the PEP yet?" Or "Repeat after me, 'bytes are not str' ... Very good, now do a set of 100 before each meal for a week." After all, there are things you can do with integer or float formats that you can't do with str and vice versa. bytes are indeed very similar to str as streams of code units (octets vs. characters), but the specific usages for human-oriented text (including such unnatural languages as C and Perl) require some differences in semantics. The sooner people get comfortable with that, the better, of course, but I don't think the language should be prevented from evolving because many people are going to take a while to get the difference and its importance.
(Of course eventually they want to do something where the format isn't identical to printf, and many of them seem to go to StackOverflow or IRC and complain that there's a "bug in str.format" instead of just glancing at the docs, so maybe making them learn early isn't such a bad thing...)
Obviously, given the snotty remark above, I sympathize. But I doubt it's really going to help that. It's just going to give them one more thing to complain about.<wink/>

On Sep 17, 2014, at 21:21, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Andrew Barnert writes:
The possibility of confusion might be increased if some of the options to bytes look like they should work for str. People will ask, "I can chunk bytes into groups of 4 with /4, why can't I do that with characters when the rest of the format specifier is the same?"
Isn't the answer to that kind of question "because you haven't written the PEP yet?"
Or "Repeat after me, 'bytes are not str' ... Very good, now do a set of 100 before each meal for a week."
As long as you don't ask for a set of 100 bytearrays, because they're not hashable.
After all, there are things you can do with integer or float formats that you can't do with str and vice versa.
bytes are indeed very similar to str as streams of code units (octets vs. characters), but the specific usages for human-oriented text (including such unnatural languages as C and Perl) require some differences in semantics. The sooner people get comfortable with that, the better, of course, but I don't think the language should be prevented from evolving because many people are going to take a while to get the difference and its importance.
I think we agree on all of that. (By the way, is there a word for that Unicode ignorance and confusion? Something like "illiteracy" and "innumeracy", but probably spelled with a non-BMP character, maybe U+1F4A9?) My point is that, given a choice between two APIs, one which reinforces the illusion that bytes are text and one which doesn't, the latter gets points. (And similarly for format vs. printf.) Of course on the other hand, when str and bytes really _are_ perfect parallels in some way, making them gratuitously inconsistent just adds more things to learn and memorize. At this point, I'm not sure that adds up to an argument for Nick's less-str-like version of his original proposal, or against it, but I'm pretty sure it's a good argument for one or other...
(Of course eventually they want to do something where the format isn't identical to printf, and many of them seem to go to StackOverflow or IRC and complain that there's a "bug in str.format" instead of just glancing at the docs, so maybe making them learn early isn't such a bad thing...)
Obviously, given the snotty remark above, I sympathize. But I doubt it's really going to help that. It's just going to give them one more thing to complain about.<wink/>
Yes, people can be amazingly good at avoiding learning.

Andrew Barnert writes:
(By the way, is there a word for that Unicode ignorance and confusion? Something like "illiteracy" and "innumeracy", but probably spelled with a non-BMP character, maybe U+1F4A9?)
"Non-superhuman." "Noncharacter" is a case in point. And yes, it's properly spelled with U+1F4A9, but my spellchecker has "parental controls" and I can't enter it. [various perceptive comments elided]
At this point, I'm not sure that adds up to an argument for Nick's less-str-like version of his original proposal, or against it, but I'm pretty sure it's a good argument for one or other...
That's exactly the way I feel. So I would say "damn the torpedos" and "Just Do It" and if it's wrong we'll fix it in the mythical-never-to- be-implemented-and-so-unmentionable-that-Big-Brother-will-undoubtedly- come-take-me-away Python 4000. Of course we should wait to see if Guido or other reliable oracle has a particular opinion, but I really don't think we're going to get proof without trying.

On 18 Sep 2014 21:22, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Of course we should wait to see if Guido or other reliable oracle has a particular opinion, but I really don't think we're going to get proof without trying.
3.5 is still a year or so away, so we have time to ponder the details. I do think it's a good direction to be considering, though. Note also that this is something that could (and probably should) be experimented with on PyPI via a wrapper class that iterated over a wrapped buffer exporter in __format__. Cheers, Nick.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 18 Sep 2014 21:22, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Of course we should wait to see if Guido or other reliable oracle has a particular opinion, but I really don't think we're going to get proof without trying.
3.5 is still a year or so away, so we have time to ponder the details. I do think it's a good direction to be considering, though. Note also that this is something that could (and probably should) be experimented with on PyPI via a wrapper class that iterated over a wrapped buffer exporter in __format__. Cheers, Nick.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

What is this "it" that you propose to just do? I'm sure I have an opinion on it once you describe it to me. On Thursday, September 18, 2014, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Andrew Barnert writes:
(By the way, is there a word for that Unicode ignorance and confusion? Something like "illiteracy" and "innumeracy", but probably spelled with a non-BMP character, maybe U+1F4A9?)
"Non-superhuman." "Noncharacter" is a case in point. And yes, it's properly spelled with U+1F4A9, but my spellchecker has "parental controls" and I can't enter it.
[various perceptive comments elided]
At this point, I'm not sure that adds up to an argument for Nick's less-str-like version of his original proposal, or against it, but I'm pretty sure it's a good argument for one or other...
That's exactly the way I feel. So I would say "damn the torpedos" and "Just Do It" and if it's wrong we'll fix it in the mythical-never-to- be-implemented-and-so-unmentionable-that-Big-Brother-will-undoubtedly- come-take-me-away Python 4000.
Of course we should wait to see if Guido or other reliable oracle has a particular opinion, but I really don't think we're going to get proof without trying.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org <javascript:;> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (on iPad)

Guido van Rossum writes:
What is this "it" that you propose to just do? I'm sure I have an opinion on it once you describe it to me.
I'm sorry, I probably shouldn't have taken your name in vain at this stage. There are no solid proposals yet, the details of format characters, how to use "precision", the symbol to indicate "chunking" etc are all under discussion. Brief summary and links, if you care to read further: At present there are at least three kinds of proposals on the table for a __format__ for bytes objects, with the proposals and dicussion being collected in http://bugs.python.org/issue22385. Eric V. Smith gave the following summary (edited by me for brevity) in https://mail.python.org/pipermail/python-ideas/2014-September/029353.html 1. Support exactly what the standard types (int, str, float, etc.) support, but give slightly different semantics to it. 2. Support a slightly different format specifier. The downside of this is that it might be confusing to some users, who see the printf-like formatting as some universal standard. It's also hard to document. 3. Do something radically different. I gave an example on the issue tracker [cited above], but I'm not totally serious about this. My "Just Do It" was mostly ignoring the possibility of Eric's #3 (Eric was even more deprecatory in the issue, saying "although it's insane, you could ..."). I was specifically referring to Eric's and Andrew Barnhart's discussion of potential confusion, Eric saying "if it's different, users used to printf may get confused" and Andrew saying (among other ideas) "if it's too close to the notation for str, it could exacerbate the existing confusion between bytes and str". I don't see the too close/too different issue as something we can decide without implementing it. Perhaps experience with a PyPI module would give guidance, but I'm not optimistic, the kind of user who would use a PyPI module for this feature is atypical, I think. ***** In somewhat more detail, Nick's original proposal (in that issue) follows existing format strings very closely: "x": display a-f as lowercase digits "X": display A-F as uppercase digits "#": includes 0x prefix ".prec": chunks output, placing a space after every <prec> bytes ",": uses a comma as the separator, rather than a space Further discussion and examples in https://mail.python.org/pipermail/python-ideas/2014-September/029352.html. There he made a second proposal, rather different: "h": lowercase hex "H": uppercase hex "A": ASCII (using "." for unprintable & extended ASCII) format(b"xyz", "A") -> 'xyz' format(b"xyz", "h") -> '78797a' format(b"xyz", "H") -> '78797A' Followed by a separator and "chunk size": format(b"xyz", "h 1") -> '78 79 7a' format(b"abcdwxyz", "h 4") -> '61626364 7778797a' format(b"xyz", "h,1") -> '78,79,7a' format(b"abcdwxyz", "h,4") -> '61626364,7778797a' format(b"xyz", "h:1") -> '78:79:7a' format(b"abcdwxyz", "h:4") -> '61626364:7778797a' In the "h" and "H" cases, you could request a preceding "0x" on the chunks: format(b"xyz", "h#") -> '0x78797a' format(b"xyz", "h# 1") -> '0x78 0x79 0x7a' format(b"abcdwxyz", "h# 4") -> '0x61626364 0x7778797a' Nick was clear that all of the notation in the above is tentative in his mind. The third proposal is from Eric Smith, in https://mail.python.org/pipermail/python-ideas/2014-September/029353.html (already cited above): Here's my proposal for #2: The format specifier becomes: [[fill]align][#][width][separator]][/chunksize][type]

As someone who uses ASCII or more commonly UTF-8 byte sequences, I find the current ascii-ish default display handy. That said... On 10Sep2014 20:57, Steven D'Aprano <steve@pearwood.info> wrote:
However, I do support Terry's suggestion that bytes (and, I presume, bytearray) grow some sort of easy way of displaying the bytes in hex. The trouble is, what do we actually want?
b'Abc' --> '0x416263'
To my eye that is a single number expressed in base 16 and would imply an endianness. I imagine you really mean a transcription of the bytes in hex, with a leading 0x to indicate the transcription. But it is not what my eye sees. Of course, the natural transcription above implies big endianness, as is only right and proper:-) Why not give bytes objects a .hex method, emitting bare hex with no leading 0x? That would be my first approach.
b'Abc'.decode('hexescapes') --> '\x41\x62\x63'
OTOH, this is really neat. And .decode('hex') for the former? Cheers, Cameron Simpson <cs@zip.com.au> It never will rain roses; when we want to have more roses we must plant more trees. - George Eliot, The Spanish Gypsy

On 9/10/2014 7:48 PM, Cameron Simpson wrote:
As someone who uses ASCII or more commonly UTF-8 byte sequences, I find the current ascii-ish default display handy. That said...
On 10Sep2014 20:57, Steven D'Aprano <steve@pearwood.info> wrote:
However, I do support Terry's suggestion that bytes (and, I presume, bytearray) grow some sort of easy way of displaying the bytes in hex. The trouble is, what do we actually want?
b'Abc' --> '0x416263'
To my eye that is a single number expressed in base 16 and would
To mine also.
imply an endianness. I imagine you really mean a transcription of the bytes in hex, with a leading 0x to indicate the transcription. But it is not what my eye sees.
Of course, the natural transcription above implies big endianness, as is only right and proper:-)
Why not give bytes objects a .hex method, emitting bare hex with no leading 0x? That would be my first approach.
The is the initial proposal of http://bugs.python.org/issue9951, which favor. -- Terry Jan Reedy
participants (20)
-
Andrew Barnert
-
Barry Warsaw
-
Cameron Simpson
-
Chris Angelico
-
Chris Lasher
-
Cory Benfield
-
Eric V. Smith
-
Ethan Furman
-
Greg Ewing
-
Guido van Rossum
-
Ian Cordasco
-
M.-A. Lemburg
-
Nick Coghlan
-
Paul Moore
-
Ron Adam
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Wichert Akkerman
-
Wolfgang Maier