To unicode or not to unicode

Sat Feb 21 19:39:42 EST 2009

Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)
> I understand what Unicode and MIME are for and why they exist. Neither
> their merits nor your insults change the fact that the only current
> standard governing the content of Usenet posts doesn't require their
> use.

Thorsten Kampe  <thorsten at thorstenkampe.de> wrote:
>That's right. As long as you use pure ASCII you can skip this nasty step 
>of informing other people which charset you are using. If you do use non 
>ASCII then you have to do that. That's the way virtually all newsreaders 
>work. It has nothing to do with some 21+ year old RFC. Even your Google 
>Groups "newsreader" does that ('content="text/html; charset=UTF-8"').

No, the original post demonstrates you don't have include MIME headers for
ISO 8859-1 text to be properly displayed by many newsreaders.  The fact
that your obscure newsreader didn't display it properly doesn't mean
that original poster's newsreader is broken.

>Being explicit about your encoding is 99% of the whole Unicode magic in 
>Python and in any communication across the Internet (may it be NNTP, 
>SMTP or HTTP).

HTTP requires the assumption of ISO 8859-1 in the absense of any
specified encoding. 

>Your Google Groups simply uses heuristics to guess the 
>encoding the OP probably used. Windows newsreaders simply use the locale 
>of the local host. That's guessing. You can call it assuming but it's 
>still guessing. There is no way you can be sure without any declaration.

Newsreaders assuming ISO 8859-1 instead of ASCII doesn't make it a guess.
It's just a different assumption, nor does making an assumption, ASCII
or ISO 8850-1, give you any certainty.

>And it's unpythonic. Python "assumes" ASCII and if the decodes/encoded 
>text doesn't fit that encoding it refuses to guess.

Which is reasonable given that Python is programming language where it's
better to have more conservative assumption about encodings so errors
can be more quickly diagnosed.  A newsreader however is a different
beast, where it's better to make a less conservative assumption that's
more likely to display messages correctly to the user.  Assuming ISO
8859-1 in the absense of any specified encoding allows the message to be
correctly displayed if the character set is either ISO 8859-1 or ASCII.
Doing things the "pythonic" way and assuming ASCII only allows such
messages to be displayed if ASCII is used.

					Ross Ridge

-- 
 l/  //	  Ross Ridge -- The Great HTMU
[oo][oo]  rridge at csclub.uwaterloo.ca
-()-/()/  http://www.csclub.uwaterloo.ca/~rridge/ 
 db  //