To unicode or not to unicode

Denis Kasak denis.kasak at gmail.com
Sun Feb 22 15:49:25 CET 2009


On Sun, Feb 22, 2009 at 1:39 AM, Ross Ridge <rridge at csclub.uwaterloo.ca> wrote:
> Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)
>> I understand what Unicode and MIME are for and why they exist. Neither
>> their merits nor your insults change the fact that the only current
>> standard governing the content of Usenet posts doesn't require their
>> use.
>
> Thorsten Kampe  <thorsten at thorstenkampe.de> wrote:
>>That's right. As long as you use pure ASCII you can skip this nasty step
>>of informing other people which charset you are using. If you do use non
>>ASCII then you have to do that. That's the way virtually all newsreaders
>>work. It has nothing to do with some 21+ year old RFC. Even your Google
>>Groups "newsreader" does that ('content="text/html; charset=UTF-8"').
>
> No, the original post demonstrates you don't have include MIME headers for
> ISO 8859-1 text to be properly displayed by many newsreaders.  The fact
> that your obscure newsreader didn't display it properly doesn't mean
> that original poster's newsreader is broken.

And how is this kind of assuming better than clearly stating the used
encoding? Does the fact that the last official Usenet RFC doesn't
mandate content-type headers mean that all bets are off and that we
should rely on guesswork to determine the correct encoding of a
message? No, it means the RFC is outdated and no longer suitable for
current needs.

>>Being explicit about your encoding is 99% of the whole Unicode magic in
>>Python and in any communication across the Internet (may it be NNTP,
>>SMTP or HTTP).
>
> HTTP requires the assumption of ISO 8859-1 in the absense of any
> specified encoding.

Which is, of course, completely irrelevant for this discussion. Or are
you saying that this fact should somehow obliterate the need for
specifying encodings?

>>Your Google Groups simply uses heuristics to guess the
>>encoding the OP probably used. Windows newsreaders simply use the locale
>>of the local host. That's guessing. You can call it assuming but it's
>>still guessing. There is no way you can be sure without any declaration.
>
> Newsreaders assuming ISO 8859-1 instead of ASCII doesn't make it a guess.
> It's just a different assumption, nor does making an assumption, ASCII
> or ISO 8850-1, give you any certainty.

Assuming is another way of saying "I don't know, so I'm using this
arbitrary default", which is not that different from a completely wild
guess. :-)

>>And it's unpythonic. Python "assumes" ASCII and if the decodes/encoded
>>text doesn't fit that encoding it refuses to guess.
>
> Which is reasonable given that Python is programming language where it's
> better to have more conservative assumption about encodings so errors
> can be more quickly diagnosed.  A newsreader however is a different
> beast, where it's better to make a less conservative assumption that's
> more likely to display messages correctly to the user.  Assuming ISO
> 8859-1 in the absense of any specified encoding allows the message to be
> correctly displayed if the character set is either ISO 8859-1 or ASCII.
> Doing things the "pythonic" way and assuming ASCII only allows such
> messages to be displayed if ASCII is used.

Reading this paragraph, I've began thinking that we've misunderstood
each other. I agree that assuming ISO 8859-1 in the absence of
specification is a better guess than most (since it's more likely to
display the message correctly). However, not specifying the encoding
of a message is just asking for trouble and assuming anything is just
an attempt of cleaning someone's mess. Unfortunately, it is impossible
to detect the encoding scheme just by heuristics and with hundreds of
encodings in existence today, the only real solution to the problem is
clearly stating your content-type. Since MIME is the most accepted way
of doing this, it should be the preferred way, RFC'ed or not.

-- 
Denis Kasak



More information about the Python-list mailing list