
Why did the CPython core developers decide to force the display of ASCII characters in the printable representation of bytes objects in CPython 3? For example
>>> import struct >>> # In go bytes for four floats: >>> my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12, 1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955) >>> # And out comes a speciously human-readable representation of those bytes >>> my_packed_bytes b'Why, Guido? Why?' >>> >>> # But it's just an illusion; it's truly bytes underneath! >>> a_reasonable_representation = bytes((0x57, 0x68, 0x79, 0x2c, 0x20, 0x47, 0x75, 0x69, 0x64, 0x6f, 0x3f, 0x20, 0x57, 0x68, 0x79, 0x3f)) >>> my_packed_bytes == a_reasonable_reperesentation True >>> >>> this_also_seems_reasonable = b'\x57\x68\x79\x2c\x20\x47\x75\x69\x64\x6f\x3f\x20\x57\x68\x79\x3f' >>> my_packed_bytes == this_also_seems_reasonable True
I understand bytes literals were brought in to Python 3 to aid the transition from Python 2 to Python 3 [1], but this did not imply that `repr()` on a bytes object ought to display bytes mapping to ASCII characters as ASCII characters. I have not yet found a PEP describing why this decision was made. I am now seeking to put forth a PEP to change printable representation of bytes to be simple, consistent, and easy to understand.
The current behavior printing of elements of bytes with a mapping to printable ASCII characters as those characters seems to violate multiple tenants of the Zen of Python [2]
* "Explicit is better than implicit." This display happens without the user's explicit request. * "Simple is better than complex." The printable representation of bytes is complex, surprising, and unintuitive: Elements of bytes shall be displayed as their hexadecimal value, unless such a value maps to a printable ASCII character, in which case, the character shall be displayed instead of the hexadecimal value. The underlying values of each element, however, are always integers. The printable representation of an element of a byte will always be an integer representation. The simple thing is to show the hex value for every byte, unconditionally. * "Special cases aren't special enough to break the rules." Implicit decoding of bytes to ASCII characters comes in handy only some of the time. * "In the face of ambiguity, refuse the temptation to guess." Python is guessing that I want to see some bytes as ASCII characters. In the example above, though, what I gave Python was bytes from four floating point numbers. * "There should be one-- and preferably only one --obvious way to do it." `bytes.decode('ascii', errors='backslashreplace')` already provides users the means to display ASCII characters among bytes, as a real string.
To be fair, there are two tenants of the Zen of Python that support the implicit display of ASCII characters in bytes:
* "Readability counts." * "Although practicality beats purity."
In counterargument, though, I would say that the extra readability and practicality are only served boosted in special cases (which are not special enough).
Much ado was (and continues to be) raised over Python 3 enforcing distinction between (Unicode) strings and bytes. A lot of this resentment comes from Python programmers who do not yet appreciate the difference between bytes and text†, or from those who remain apathetic and prefer Python 2's it-works-'til-it-doesn't strings. This implicit displaying of ASCII characters in bytes ends up conflating the two data types even deeper in novice programmers' minds.
In the example above, `my_packed_bytes` looks like a string. It reads like a string. But it is not a string. The ASCII characters are a lie, as evidenced when trying to access elements of a bytes instance:
>>> b'Why, Guido? Why?'[0] 87 >>> # Oh, perhaps you were expecting b'W'?
I find this behavior harmful to Python 3 advocacy, and novices and those accustomed to Python 2 find this yet another deterrent in the way of Python 3 adoption.
I would like to gauge the feasibility of a PEP to change the printable representation of bytes in CPython 3 to display all elements by their hexadecimal values, and only by their hexadecimal values.
Thanks, Chris L.
† I write this as someone who, himself, didn't appreciate nor understand the difference between bytes, strings, and Unicode. I have Ned Batchelder [3] to thank and his illuminating "Pragmatic Unicode" presentation [4] for getting me on the right path.
[1]: http://legacy.python.org/dev/peps/pep-3112/#rationale [2]: http://legacy.python.org/dev/peps/pep-0020/ [3]: http://nedbatchelder.com/ [4]: http://nedbatchelder.com/text/unipain.html