
On 5/20/2011 1:44 AM, Stephen J. Turnbull wrote:
For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient.
For us, the convenience remains.
I understood the thrust of this thread being that doing text manipulation with bytes sometimes bites -- because bytes are not text. Someone writing email or html bodies in Japanese or Farsi will not even try that, but will use str (unicode) and encode to bytes only when done, most likely transparently.. As far as I noticed, Ethan did not explain why he was extracting single bytes and comparing to a constant, so it is hard to know if he was even using them properly.
Japanese mail is transmitted via SMTP, and the control function "hello" is still spelled "EHLO" in Japanese mail.
I am not familiar with that control function, but if it is part of the SMTP protocol, it has nothing to do with the language of the payload. For programming a wire protocol that encodes abstract functions in ascii chars, then the ascii char representation of bytes in convenient. That is why it was chosen as the default.
Farsi web pages are formatted by HTML, and the control function "new line" is spelled "<BR>" in Farsi, of course.
When writing the html *text* body, sure. But I presume browsers decode encoded bytes to unicode *before* parsing the text. If so, it does not really matter that '<br>' gets encoded to b'<br>'.
It's the pain that comes from the inevitable mixing of binary protocol that looks like text with real text, turning the whole into an unintelligible garble, that hurts so much harder for people who can't properly write their names in ASCII.
ターンブル・スティーヴェンです-ly y'rs,
-- Terry Jan Reedy