Re: [Python-ideas] Stop displaying elements of bytes objects as printable ASCII characters in CPython 3

10 Sep 2014

      On Sep 10, 2014, at 11:35, Chris Lasher  wrote:
...
I originally wrote this late last night but realized today that I only
sent this reply to Terry Reedy, not to python-ideas. (Apologies, Terry
– I didn't mean to single you out with my rant!)
I'm reposting it in full, below. Some of these ideas have already been
raised by others and counter-arguments already posed. I still feel I
have not seen some of these points directly addressed, namely, the
unreasonableness of seeing bytes from floating point numbers as ASCII
characters, and the sanity of the API I counter-propose.
Message now appears below:
On Wed, Sep 10, 2014 at 1:11 AM, Terry Reedy  wrote:
...
I agree with Chris Lasher's basic point, that the representation of bytes confusingly contradicts the idea that bytes are bytes.  But it is not going to change.
Unless printable representation of bytes objects appears as part of
the language specification for Python 3, it's an implementation
detail, thus, it is a candidate for change, especially if the BDFL
wills it so. Consider me optimistic that we can change it, or I would
have just posted yet another "Python 3 gets it all wrong" blog post to
the web instead of writing this pre-proposal. :-)
...
On 9/10/2014 3:56 AM, Cory Benfield wrote:
...
On 10 September 2014 08:45, Nick Coghlan  wrote:
...
memoryview.cast can be a potentially useful tool for that :)
Sure, and so can binascii.hexlify (which is what I normally use).
See http://bugs.python.org/issue9951 to add bytes.hex or .tohex as more of less the inverse of bytes.fromhex or even have hex(bytes) work.  This change *is* possible and I think we should pick one of the suggestions for 3.5.
Here's the API Issue 9951 is proposing:
...
...
...
b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
   b'Hello, World!'
b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.tohex()
   b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
b'Hello, World!'
   b'Hello, World!'
b'Hello, World!'.tohex()
   b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
I'll tell you what: here's the API of my counter-proposal:
...
...
...
b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
   b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'.asciify()
   b'Hello, World!'
b'Hello, World!'
   b'\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21'
b'Hello, World!'.asciify()
   b'Hello, World!'
It strikes me that we should have both asciify and hexlify (or whatever we call them) so people can be explicit when debugging; the question then becomes which one repr calls. At which point it really is just a question of which group of developers (those working on HTTP/2.0 or those working on HTTP/1.1, for example) get to be "lazy" instead of explicit in their debugging.

The argument in favor of "asciify" is that the hex representation is more purist.

The argument in favor of "hexlify" is that it makes Python 3.6 do the same thing as 3.0-3.5, and in fact 1.0-2.7 as well; people have had a few decades to get used to being lazy with mostly-ASCII protocols, while people have had a few decades to get used to being explicit with pure-binary protocols.

But maybe there's another potential concern that can help decide. A lot of novices using bytes get confused when they see b'\x05Hello' and ask questions about how to deal with that 8-character string. (You can see them all over StackOverflow, for example.) Of course the same people also ask how to get the b out of their string, etc.; obviously they need to be taught the difference between a bytes and its repr no matter what. Would switching to hexlify as a default help those people by forcing them to confront their confusion early, or slow them down by not letting them write a lot of simple code and learn other important stuff before getting to that confusion? I that the answer to that might be as compelling as the answer to which group of experienced developers (where the groups often overlap) deserves to be allowed to be lazy. But I don't have the answer...
...
Here's the prose description of my counter-proposal: add a method to
the bytes object called `.asciify`, that returns a printable
representation of the bytes, where bytes mapping to printable ASCII
characters are displayed as ASCII characters, and the remainder are
given as hex codes. That is, .asciify() should round-trip a bytes
literal. This frees up repr() to do what universally makes sense on a
series of bytes: state the bytes!
Marc-Andre Lemburg said:
...
A definite -1 from me on making repr(b"Hello World") harder to read than necessary.
Okay, but a definite -1e6 from me on making my Python interpreter do this:
...
...
...
my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12,
1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955)
my_packed_bytes
   b'Why, Guido? Why?'
I do understand the utility of peering in to ASCII text, but like Cory
Benfield stated earlier:
...
I'm saying that I don't get to do debugging with a simple
print statement when using the bytes type to do actual binary work,
while those who are doing sort-of binary work do.
Does the inconvenience of having to explicitly call the .asciify()
method on a bytes object justify the current behavior for repr() on a
bytes object? The privilege of being lazy is obstructing the right to
see what we've actually got in the bytes object, and is jeopardizing
the very argument that "bytes are not strings".
On Wed, Sep 10, 2014 at 10:51 AM, Cory Benfield  wrote:
...
On 10 September 2014 17:59, Stephen J. Turnbull  wrote:
...
So does 0xDEADBEEF, but actually that's *not* text, it's a 32-bit
pointer, conveniently invalid on most 32-bit architectures and very
obvious when it shows up in a backtrace.  Do you see an impedence
mismatch in the C community because of that?
In fact, *all* bytes "look like text", because *you can't see them
until they're converted to text by repr()*!  This is the key to the
putative "impedence mismatch" -- it's perceived as such when people
don't distinguish the map from the territory.
I apologise, I was insufficiently clear. I mean that interaction with
the bytes type in Python has a lot of textual aspects to it. This is a
*deliberate* decision (or at least the documentation makes it seem
deliberate), and I can understand the rationale, but it's hard to be
surprised that it leads developers astray.
Also, while I'm being picky, 0xDEADBEEF is not a 32-bit pointer, it's
a 32-bit something. Its type is undefined in that expression. It has a
standard usage as a guard word, but still, let's not jump to
conclusions here!
I accept your core point, however, which I consider to be this:
...
The issue that sometimes it's easier to read hex than ASCII mixed with
other stuff (hex escapes or Latin-1) is true enough, though.  But it's
not about an impedence mismatch, it's a question of what does *this*
developer consider to be the convenient repr for *that* task.
This is definitely true, which I believe I've already admitted in this
thread. I do happen to believe that having it be hex would provide a
better pedagogical position ("you know this isn't text because it
looks like gibberish!"), but that ship sailed a long time ago.
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Stop displaying elements of bytes objects as printable ASCII characters in CPython 3

Andrew Barnert