[Python-ideas] Stop displaying elements of bytes objects as printable ASCII characters in CPython 3

Wed Sep 10 21:27:10 CEST 2014

On Sep 10, 2014, at 11:35, Chris Lasher <chris.lasher at gmail.com> wrote:

> I originally wrote this late last night but realized today that I only
> sent this reply to Terry Reedy, not to python-ideas. (Apologies, Terry
> – I didn't mean to single you out with my rant!)
> 
> I'm reposting it in full, below. Some of these ideas have already been
> raised by others and counter-arguments already posed. I still feel I
> have not seen some of these points directly addressed, namely, the
> unreasonableness of seeing bytes from floating point numbers as ASCII
> characters, and the sanity of the API I counter-propose.
> 
> Message now appears below:
> 
> On Wed, Sep 10, 2014 at 1:11 AM, Terry Reedy <tjreedy at udel.edu> wrote:
>> 
>> I agree with Chris Lasher's basic point, that the representation of bytes confusingly contradicts the idea that bytes are bytes.  But it is not going to change.
> 
> 
> Unless printable representation of bytes objects appears as part of
> the language specification for Python 3, it's an implementation
> detail, thus, it is a candidate for change, especially if the BDFL
> wills it so. Consider me optimistic that we can change it, or I would
> have just posted yet another "Python 3 gets it all wrong" blog post to
> the web instead of writing this pre-proposal. :-)
> 
>> 
>> 
>> 
>> On 9/10/2014 3:56 AM, Cory Benfield wrote:
>>> 
>>> On 10 September 2014 08:45, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>>> 
>>>> memoryview.cast can be a potentially useful tool for that :)
>>> 
>>> 
>>> Sure, and so can binascii.hexlify (which is what I normally use).
>> 
>> 
>> See http://bugs.python.org/issue9951 to add bytes.hex or .tohex as more of less the inverse of bytes.fromhex or even have hex(bytes) work.  This change *is* possible and I think we should pick one of the suggestions for 3.5.
> 
> 
> 
> Here's the API Issue 9951 is proposing:
> 
>>>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
>    b'Hello, World!'
>>>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.tohex()
>    b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
>>>> b'Hello, World!'
>    b'Hello, World!'
>>>> b'Hello, World!'.tohex()
>    b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
> 
> 
> I'll tell you what: here's the API of my counter-proposal:
> 
>>>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
>    b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
>>>> b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.asciify()
>    b'Hello, World!'
>>>> b'Hello, World!'
>    b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
>>>> b'Hello, World!'.asciify()
>    b'Hello, World!'

It strikes me that we should have both asciify and hexlify (or whatever we call them) so people can be explicit when debugging; the question then becomes which one repr calls. At which point it really is just a question of which group of developers (those working on HTTP/2.0 or those working on HTTP/1.1, for example) get to be "lazy" instead of explicit in their debugging.

The argument in favor of "asciify" is that the hex representation is more purist.

The argument in favor of "hexlify" is that it makes Python 3.6 do the same thing as 3.0-3.5, and in fact 1.0-2.7 as well; people have had a few decades to get used to being lazy with mostly-ASCII protocols, while people have had a few decades to get used to being explicit with pure-binary protocols.

But maybe there's another potential concern that can help decide. A lot of novices using bytes get confused when they see b'\x05Hello' and ask questions about how to deal with that 8-character string. (You can see them all over StackOverflow, for example.) Of course the same people also ask how to get the b out of their string, etc.; obviously they need to be taught the difference between a bytes and its repr no matter what. Would switching to hexlify as a default help those people by forcing them to confront their confusion early, or slow them down by not letting them write a lot of simple code and learn other important stuff before getting to that confusion? I that the answer to that might be as compelling as the answer to which group of experienced developers (where the groups often overlap) deserves to be allowed to be lazy. But I don't have the answer...

> Here's the prose description of my counter-proposal: add a method to
> the bytes object called `.asciify`, that returns a printable
> representation of the bytes, where bytes mapping to printable ASCII
> characters are displayed as ASCII characters, and the remainder are
> given as hex codes. That is, .asciify() should round-trip a bytes
> literal. This frees up repr() to do what universally makes sense on a
> series of bytes: state the bytes!
> 
> 
> Marc-Andre Lemburg said:
>> 
>> A definite -1 from me on making repr(b"Hello World") harder to read than necessary.
> 
> 
> Okay, but a definite -1e6 from me on making my Python interpreter do this:
> 
>>>> my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12,
> 1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955)
>>>> my_packed_bytes
>    b'Why, Guido? Why?'
> 
> I do understand the utility of peering in to ASCII text, but like Cory
> Benfield stated earlier:
> 
>> I'm saying that I don't get to do debugging with a simple
>> print statement when using the bytes type to do actual binary work,
>> while those who are doing sort-of binary work do.
> 
> 
> Does the inconvenience of having to explicitly call the .asciify()
> method on a bytes object justify the current behavior for repr() on a
> bytes object? The privilege of being lazy is obstructing the right to
> see what we've actually got in the bytes object, and is jeopardizing
> the very argument that "bytes are not strings".
> 
> On Wed, Sep 10, 2014 at 10:51 AM, Cory Benfield <cory at lukasa.co.uk> wrote:
>> On 10 September 2014 17:59, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>>> So does 0xDEADBEEF, but actually that's *not* text, it's a 32-bit
>>> pointer, conveniently invalid on most 32-bit architectures and very
>>> obvious when it shows up in a backtrace.  Do you see an impedence
>>> mismatch in the C community because of that?
>>> 
>>> In fact, *all* bytes "look like text", because *you can't see them
>>> until they're converted to text by repr()*!  This is the key to the
>>> putative "impedence mismatch" -- it's perceived as such when people
>>> don't distinguish the map from the territory.
>> 
>> I apologise, I was insufficiently clear. I mean that interaction with
>> the bytes type in Python has a lot of textual aspects to it. This is a
>> *deliberate* decision (or at least the documentation makes it seem
>> deliberate), and I can understand the rationale, but it's hard to be
>> surprised that it leads developers astray.
>> 
>> Also, while I'm being picky, 0xDEADBEEF is not a 32-bit pointer, it's
>> a 32-bit something. Its type is undefined in that expression. It has a
>> standard usage as a guard word, but still, let's not jump to
>> conclusions here!
>> 
>> I accept your core point, however, which I consider to be this:
>> 
>>> The issue that sometimes it's easier to read hex than ASCII mixed with
>>> other stuff (hex escapes or Latin-1) is true enough, though.  But it's
>>> not about an impedence mismatch, it's a question of what does *this*
>>> developer consider to be the convenient repr for *that* task.
>> 
>> This is definitely true, which I believe I've already admitted in this
>> thread. I do happen to believe that having it be hex would provide a
>> better pedagogical position ("you know this isn't text because it
>> looks like gibberish!"), but that ship sailed a long time ago.
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/