P.J. Eby writes:
In Kagoshima, you'd use pass in an ebytes with your encoding to a stdlib API, and *get back an ebytes with the right encoding*, rather than an (incorrect and useless) unicode object which has lost data you need.
How does the stdlib do that? Unless it guesses which encoding for Japanese is being used? And even if this ebytes uses Shift JIS, what makes that the "right" encoding for anything? On the other hand, I know when *I* need some encoding, and when I figure it out I will store it in an appropriate place in my program. The problem is that for some programs it is not unlikely that I will see all of Shift JIS, EUC-JP, ISO-2022-JP, UTF-8, and UTF-16, and on a very bad day, RFC 2047, GB 2312, and Big5, too, used to encode Japanese. It's not totally unlikely for a browser to send URLs to a server expecting UTF-8 to recover a message/rfc822 object containing ISO-2022-JP in the mail header and EUC-JP in the body. So I need to know which encoding was used by the server that sent the reply, but the ebytes can't tell me that if it fishes an URL in EUC-JP out of the message body. I need to convert that URL to UTF-8, or most servers will 404.
But this is not the case at all, for use cases where "no, really, you *have to* work with bytes-encoded text streams". The mere release of Python 3.x will not cause all the world's applications, libraries, and protocols to suddenly work with unicode, where they did not before.
Sure. That's what .encode() and .decode() are for. The problem is what to do when you don't know what to put in the parentheses, and I can't think of a use case offhand where ebytes(stuff,'garbage') does better than PEP 383-enabled str for:
Being explicit about the encoding of the bytes you're flinging around is actually an *increase* in specificity, explicitness, robustness, and error-checking ability over the status quo for either 2.x *or* 3.x... *and* it improves these qualities for essentially *all* string-handling code, without requiring that code to be rewritten to do so.
A well-spoken piece. But, you see, most of those encodings are *only* interesting so that you can transcode characters to the encoding of interest. What's the e.o.i.? That is easily found in the context or has an obvious default, if you're lucky, or otherwise a hard problem that ebytes does nothing to help solve as far as I can see. Cf. Robert Collins' post <AANLkTinQ_d_vaHBw5IKUYY9qgjqOfFy4XCzC0DYztr9n@mail.gmail.com>, where he makes it quite explicit that a bytes interface is all about punting in the face of missing encoding information.
and (2) you really want this under control of higher level objects that have access to some knowledge of the environment, rather than the lowest level.
This proposal actually has such a higher-level object: an ebytes.
I don't see how that can be true. An ebytes is a very low-level object that has no idea whether its encoding is interesting (eg, the one that an RFC or a server specifies), or a technical detail of use only until the ebytes is decoded, then can be thrown away. I just don't see, in the case where there is a real encoding in the ebytes, what harm is done by decoding the ebytes to str. If context indicates that the encoding is an interesting one (eg, it should be the default for encoding on output), then you want to save that in an appropriate place that preserves not just the encoding itself, but the context that gives it its importance.