[Python-ideas] Python 3.x and bytes

Sun May 22 17:46:20 CEST 2011

Terry Reedy writes:

 > As far as I noticed, Ethan did not explain why he was extracting single
 > bytes and comparing to a constant, so it is hard to know if he was even
 > using them properly.

It doesn't really matter whether Ethan is using them properly.  It's
clear there are such uses, though I don't know how important they are,
so we may as well assume Ethan's is one such.

 > > Japanese mail is transmitted via SMTP, and the control function
 > > "hello" is still spelled "EHLO" in Japanese mail.
 > 
 > I am not familiar with that control function, but if it is part of
 > the SMTP protocol, it has nothing to do with the language of the
 > payload.

Precisely my point.  Therefore a payload represented as bytes should
be treated as *uninterpreted* bytes, except where interpretations are
defined for those bytes.  This works for SMTP, because RFC 822
*deliberately* specifies headers to be encoded in ASCII (not
"ASCII-compatible") in order that the payload (header) manipulations
specified by RFC 821 and friends be guaranteed correct.

Nevertheless, people frequently request mail processing features that
require manipulations of MIME part bodies and even plain RFC 822
message bodies.  These cannot be guaranteed correct unless done by
decoding and reencoding, but bytes-oriented manipulations generally
"work" in monolingual contexts (or seem to, and any problems can
always be blamed on MS Outlook).  There are several such features that
come up over and over again on Mailman lists and sometimes in the
Python Email SIG, and I'm sure the same is true for web protocols.

 > > Farsi web pages are formatted by HTML, and the control
 > > function "new line" is spelled "<BR>" in Farsi, of course.
 > 
 > When writing the html *text* body, sure. But I presume browsers decode
 > encoded bytes to unicode *before* parsing the text. If so, it does not
 > really matter that '<br>' gets encoded to b'<br>'.

HTML is not exclusively processed by browsers.  It is often processed
by servers and middleware that don't know they're speaking HTML, and
according to several experts' testimony, they're in a freakin' hurry
to push bytes out the door, there's no time for Unicode (decoding and
encoding, OMG how inefficient!)

Such developers want to write their libraries using bytes *and*
literals that can be used both for binary protocols and for text
protocols (urlparse seems to be the canonical example).

The convenience of using bytes in a string-like way (eg, the b''
literal) in manipulating many binary protocols is clear.  That
convenience is just as great for people who are at substantial risk of
mojibake if bytes are used to do text manipulations on the encoded
form, as well as for people who face little risk (eg, those who use
only American English).

The question is how far to go with polymorphism, etc.  I think that
Nick's urlparse work gets the balance about right, and see only danger
in more stringlike bytes (eg, by returning b'b' for b'bytes'[0]).
OTOH, there are some changes that might be useful but seem very
low-risk, such as a c'b' literal that means 98, not b'b'.