[Python-ideas] Stop displaying elements of bytes objects as printable ASCII characters in CPython 3

Wed Sep 10 09:04:23 CEST 2014

Why did the CPython core developers decide to force the display of
ASCII characters in the printable representation of bytes objects in
CPython 3? For example

    >>> import struct
    >>> # In go bytes for four floats:
    >>> my_packed_bytes = struct.pack('ffff', 3.544294848931151e-12,
1.853266900760489e+25, 1.6215185358725202e-19, 0.9742483496665955)
    >>> # And out comes a speciously human-readable representation of
those bytes
    >>> my_packed_bytes
    b'Why, Guido? Why?'
    >>>
    >>> # But it's just an illusion; it's truly bytes underneath!
    >>> a_reasonable_representation = bytes((0x57, 0x68, 0x79, 0x2c,
0x20, 0x47, 0x75, 0x69, 0x64, 0x6f, 0x3f, 0x20, 0x57, 0x68, 0x79,
0x3f))
    >>> my_packed_bytes == a_reasonable_reperesentation
    True
    >>>
    >>> this_also_seems_reasonable =
b'\x57\x68\x79\x2c\x20\x47\x75\x69\x64\x6f\x3f\x20\x57\x68\x79\x3f'
    >>> my_packed_bytes == this_also_seems_reasonable
    True

I understand bytes literals were brought in to Python 3 to aid the
transition from Python 2 to Python 3 [1], but this did not imply that
`repr()` on a bytes object ought to display bytes mapping to ASCII
characters as ASCII characters. I have not yet found a PEP describing
why this decision was made. I am now seeking to put forth a PEP to
change printable representation of bytes to be simple, consistent, and
easy to understand.

The current behavior printing of elements of bytes with a mapping to
printable ASCII characters as those characters seems to violate
multiple tenants of the Zen of Python [2]

* "Explicit is better than implicit." This display happens without the
user's explicit request.
* "Simple is better than complex." The printable representation of
bytes is complex, surprising, and unintuitive: Elements of bytes shall
be displayed as their hexadecimal value, unless such a value maps to a
printable ASCII character, in which case, the character shall be
displayed instead of the hexadecimal value. The underlying values of
each element, however, are always integers. The printable
representation of an element of a byte will always be an integer
representation. The simple thing is to show the hex value for every
byte, unconditionally.
* "Special cases aren't special enough to break the rules." Implicit
decoding of bytes to ASCII characters comes in handy only some of the
time.
* "In the face of ambiguity, refuse the temptation to guess." Python
is guessing that I want to see some bytes as ASCII characters. In the
example above, though, what I gave Python was bytes from four floating
point numbers.
* "There should be one-- and preferably only one --obvious way to do
it." `bytes.decode('ascii', errors='backslashreplace')` already
provides users the means to display ASCII characters among bytes, as a
real string.

To be fair, there are two tenants of the Zen of Python that support
the implicit display of ASCII characters in bytes:

* "Readability counts."
* "Although practicality beats purity."

In counterargument, though, I would say that the extra readability and
practicality are only served boosted in special cases (which are not
special enough).

Much ado was (and continues to be) raised over Python 3 enforcing
distinction between (Unicode) strings and bytes. A lot of this
resentment comes from Python programmers who do not yet appreciate the
difference between bytes and text†, or from those who remain apathetic
and prefer Python 2's it-works-'til-it-doesn't strings. This implicit
displaying of ASCII characters in bytes ends up conflating the two
data types even deeper in novice programmers' minds.

In the example above, `my_packed_bytes` looks like a string. It reads
like a string. But it is not a string. The ASCII characters are a lie,
as evidenced when trying to access elements of a bytes instance:

    >>> b'Why, Guido? Why?'[0]
    87
    >>> # Oh, perhaps you were expecting b'W'?

I find this behavior harmful to Python 3 advocacy, and novices and
those accustomed to Python 2 find this yet another deterrent in the
way of Python 3 adoption.

I would like to gauge the feasibility of a PEP to change the printable
representation of bytes in CPython 3 to display all elements by their
hexadecimal values, and only by their hexadecimal values.

Thanks,
Chris L.

† I write this as someone who, himself, didn't appreciate nor
understand the difference between bytes, strings, and Unicode. I have
Ned Batchelder [3] to thank and his illuminating "Pragmatic Unicode"
presentation [4] for getting me on the right path.

  [1]: http://legacy.python.org/dev/peps/pep-3112/#rationale
  [2]: http://legacy.python.org/dev/peps/pep-0020/
  [3]: http://nedbatchelder.com/
  [4]: http://nedbatchelder.com/text/unipain.html