On 11 September 2014 09:23, Chris Lasher chris.lasher@gmail.com wrote:
On Wed, Sep 10, 2014 at 3:09 PM, Nick Coghlan ncoghlan@gmail.com wrote:
In Python 3, "bytes" is still a hybrid type that can hold:
- arbitrary binary data
- binary data that contains ASCII segments
Let me be clear. Here are things this proposal does NOT include:
- Removing string-like methods from bytes
- Removing ASCII from bytes literals
Those have proven incredibly useful to the Python community. I appreciate that. This proposal does not take these behaviors away from bytes.
Here's what my proposal DOES include:
- Adjust the behavior of repr() on a bytes instance such that only
hexadecimal codes appear. The returned value would be the text displaying the bytes literal of hexadecimal codes that would reproduce the bytes instance.
This is not an acceptable change, for two reasons:
1. It's a *major* compatibility break. It breaks single source Python 2/3 development, it breaks doctests, it breaks user expectations. 2. It breaks the symmetry between the bytes literal format and their representation.
It's important to remember we changed *from* a pure binary representation back to the current hybrid representation. It's not an accident or oversight, it's a deliberate design choice, and the reasons driving that original decision haven't changed in the last 8+ years.
- Provide a method (suggested: "bytes.asciify") that returns a printable
representation of bytes that replaces bytes whose values map to printable ASCII glyphs with the glyphs. The returned value would be the text displaying the bytes literal of ASCII glyphs and hexadecimal codes that would reproduce the bytes instance. If you liked the behavior of repr() on bytes in Python 3.0 through 3.4 (or 3.5), it's still available via this method call!
Except that method call won't be available in Python 2 code, and thus not usable in single source Python 2/3 code bases. That's still an incredibly important environment for people to be able to program in, and we're generally aiming to make the common subset *bigger* in Python 3.5 (e.g. by adding bytes.__mod__), not smaller.
- Optionally, provide a method (suggested: "bytes.hexlify") which
implements the code for creating the printable representation of the bytes with hexadecimal values only, and call this method in bytes.__repr__.
As per the discussion on issue 9951, it is likely Python 3.5 will either offer bytes.hex() and bytearray.hex() methods (and perhaps even memoryview.hex()).
I have also filed issue 22385 to propose allowing the "x" and "X" string formatting characters (for str.format and the format builtin) to accept arbitrary bytes-like objects.
*Additive* changes like that to make it easier to work with pure binary data are relatively non-controversial (although there may still be some argument over *which* of those changes are worth including).
What you haven't said so far, however, and what I still don't know, is whether or not the core team has already tried providing a method on bytes objects à la the proposed .asciify() for projecting bytes as ASCII characters, and rejected that on the basis of it being too inconvenient for the vast majority of Python use cases.
That option was never really on the table, as once we decided back to switch to a hybrid ASCII representation, the obvious design model to use was the Python 2 str type, which has inherently hybrid behaviour, and uses the literal form for the "obj == eval(repr(obj))" round trip.
Did the core team try this, before deciding that this should be the result from repr() should automatically rewrite printable ASCII characters in place of hex values for bytes?
So far, I've heard a lot of requests to keep the behavior because it's convenient. But how inconvenient is it to call bytes.asciify()? Are those not in favor of changing the behavior of repr() really going to sit behind the argument that the effort expended in typing ten more characters ought to guarantee that thousands of other programmers are going to have to figure out why there's letters in their bytes – or rather, how there's actually NOT letters in their bytes?
No, we're not keeping it because it's convenient, we're keeping it because changing it would be a major compatibility break for (at best) a small reduction in beginner confusion. This change simply wouldn't provide sufficient benefit to justify the massive scale of the disruption it would cause.
By contrast, adding better *binary* representation tools is easy (they pose no backwards compatibility challenges), and hence the preferred choice. When teaching beginners, explaining the difference between:
>>> b"abc" b'abc' >>> b"abc".hex() '616263'
Is likely to be pretty straightforward (and will teach them the relevant concept of ASCII based vs hexadecimal representations for binary data).
Consider the proposed alternative, which is to instead have to explain:
>>> b"abc" b'\x61\x62\x63' >>> b"abc".hex() '616263' >>> b"abc".ascii() 'abc'
That's 3 different representations when there are only two underlying concepts to be learned.
And once again, we are talking about changing behavior that is unspecified by the Python 3 language specification.
Something being underspecified in the language specification doesn't mean we have free rein to change it on a whim - sometimes it just means there's an assumed detail that hasn't been explicitly stated, but implementors of alternative implementations hadn't previously commented on the omission because they just followed the behaviour of CPython as the reference interpreter, or the requirements of the regression test suite.
It's really necessary to look at the regression test suite, along with the written specification, as things that aren't part of the language spec are marked as "CPython only". Cases where it's CPython that is out of line when other interpreter implementations discover a compatibility issue get filed as CPython bugs (like the one where we sometimes get the operand precedence wrong if both sequences in a binary concatenation operation are implemented in C and the sequences are of different types).
In this case, the underspecification relates to the fact that for builtin types that have dedicated syntax, the expectation is that their repr will use that dedicated syntax. This is not currently stated explicitly in the language reference (and I agree it probably should be), but it's tested extensively by the regression test suite, so it becomes a backwards compatibility constraint and an alternative interpreter compatibility constraint.
The language is gaining a reputation for confusing the two
It isn't "gaining" that reputation, it has always had it. The reputation for it is actually *reducing* over time, as we spend more time working with other implementations like PyPy, Jython and IronPython to get the CPython implementation details marked appropriately.
(C)Python itself hasn't changed in this regard - we're just starting to do a better job of getting the wildly divergent groups of users actually talking to each other (with occasional fireworks as people have to come to grips with some radically different viewpoints on the nature and purpose of software development).
In particular, we're starting to see folks that had previously focused almost entirely on the application programming and network service development side of Python (which tends to heavily abstract away the C layer) start to learn more about the system orchestration, hardware automation and scientific programming side of Python that lets you dive as deeply into the machine internals as you like.
Most language runtimes only let you handle one or the other of those categories well - CPython is a relatively rare breed in supporting both, which *does* have consequences that make many of our design decisions seem weird to folks that aren't looking at *all* the use cases for the language in general, and the CPython runtime in particular.
however, as written by Armin Ronacher [1]:
Python is definitely a language that is not perfect. However I think what frustrates me about the language are largely problems that have to do with tiny details in the interpreter and less the language itself. These interpreter details however are becoming part of the language and this is why they are important.
I feel passionately this implicit ASCII-translation behavior should not propagate into further releases CPython 3, and I don't want to see it become a de facto specification due to calcification.
It's not a de facto specification it's a deliberate design choice, made before Python 3.0 was even released, and captured by the regression test suite.
We're talking about the next 10 to 15 years. Nobody guaranteed the behavior of repr() so far. With the bytes.asciify() method (or whatever it may be called), we have a fair compromise, plus a more explicit specification of behavior of bytes in Python 3.
Lots of folks don't like the fact that CPython doesn't completely hide the underlying memory model of C from the user - it's a deliberately leaky abstraction. The approach certainly has its downsides, but that leaky abstraction is what allows people to be confident that they can use Python as a convenient orchestration language, knowing that we will have easy access to the kind of low level control offered by C (and other systems programming languages) if we need it. This is why the scientific Python stack currently works best on CPython, with the ports to PyPy, Jython and IronPython (which all abstract away the C layer far more heavily) at varying stages of maturity - it's simply harder to do array oriented programming in those environments, since the language runtimes weren't built with that use case in mind (neither was CPython, but the relatively close coupling to the C layer enabled the capability anyway).
Computers are complicated layers of messy and leaky abstractions. Working too hard at hiding those layers from the user just means developers can't bypass the abstraction easily when they know what they need for their current use case better than the original author of the language runtime.
Regards, Nick.