
Terry Reedy writes:
As far as I noticed, Ethan did not explain why he was extracting single bytes and comparing to a constant, so it is hard to know if he was even using them properly.
It doesn't really matter whether Ethan is using them properly. It's clear there are such uses, though I don't know how important they are, so we may as well assume Ethan's is one such.
Japanese mail is transmitted via SMTP, and the control function "hello" is still spelled "EHLO" in Japanese mail.
I am not familiar with that control function, but if it is part of the SMTP protocol, it has nothing to do with the language of the payload.
Precisely my point. Therefore a payload represented as bytes should be treated as *uninterpreted* bytes, except where interpretations are defined for those bytes. This works for SMTP, because RFC 822 *deliberately* specifies headers to be encoded in ASCII (not "ASCII-compatible") in order that the payload (header) manipulations specified by RFC 821 and friends be guaranteed correct. Nevertheless, people frequently request mail processing features that require manipulations of MIME part bodies and even plain RFC 822 message bodies. These cannot be guaranteed correct unless done by decoding and reencoding, but bytes-oriented manipulations generally "work" in monolingual contexts (or seem to, and any problems can always be blamed on MS Outlook). There are several such features that come up over and over again on Mailman lists and sometimes in the Python Email SIG, and I'm sure the same is true for web protocols.
Farsi web pages are formatted by HTML, and the control function "new line" is spelled "<BR>" in Farsi, of course.
When writing the html *text* body, sure. But I presume browsers decode encoded bytes to unicode *before* parsing the text. If so, it does not really matter that '<br>' gets encoded to b'<br>'.
HTML is not exclusively processed by browsers. It is often processed by servers and middleware that don't know they're speaking HTML, and according to several experts' testimony, they're in a freakin' hurry to push bytes out the door, there's no time for Unicode (decoding and encoding, OMG how inefficient!) Such developers want to write their libraries using bytes *and* literals that can be used both for binary protocols and for text protocols (urlparse seems to be the canonical example). The convenience of using bytes in a string-like way (eg, the b'' literal) in manipulating many binary protocols is clear. That convenience is just as great for people who are at substantial risk of mojibake if bytes are used to do text manipulations on the encoded form, as well as for people who face little risk (eg, those who use only American English). The question is how far to go with polymorphism, etc. I think that Nick's urlparse work gets the balance about right, and see only danger in more stringlike bytes (eg, by returning b'b' for b'bytes'[0]). OTOH, there are some changes that might be useful but seem very low-risk, such as a c'b' literal that means 98, not b'b'.