[I18n-sig] Autoguessing charset for Unicode strings?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 20 Jun 2001 01:27:28 +0200

> So far so good.  Now let's say I want to go in the other direction,
> i.e. given a Unicode string, I want to create the RFC 2047 encoded
> string to add to the header, so I need to be able to go "the other way
> 'round".  Is this possible without requiring the user to explicitly
> provide the charset that the Unicode string is encoded with?

Yes, doing so is trivial - the tricky part is to make work elegant.

> My understanding is that the unicode string doesn't have a notion of
> the charset that it was encoded with, but is it possible to guess the
> charset of a Unicode string reliably?  Even if you can only guess 80%
> of the time, that'd be fine if I can throw an exception for the other
> 20%.  Is there an existing Python solution for this?  Does my question
> even make sense? ;)

Your question makes perfect sense, it is one of the rather troubling
problems in the world of character set conversions. Another form of
the same problem is "how does Tk pick the right font to display some
unicode string"?

Back to your question: The easiest path is to always use UTF-8 as the
outgoing character set. UTF-8 is a well-recognized MIME encoding
(although I forgot the RFC number), and it is capable of encoding all
Unicode strings lossless.

However, that might produce quotations even if there are no funny
characters in the string, so a better procedure might be:

1. try to encode as ASCII. If that succeeds, no quotation is needed
2. if that fails, use UTF-8

Now, many email readers will still choke these days when they see
UTF-8 (the Microsoft ones being positive exceptions here), but will
recognize Latin-1. So, another procedure might be

1. try to encode as ASCII
2. if that fails, try iso-8859-1
3. if that fails, use UTF-8

You'll see that this becomes more and more expensive. People now may
propose that this really should be application controlled, but I think
they'd be misguided: the application is normally in no better position
to select a "good" encoding than the library.

The latter algorithm may also be considered Euro-centric. It probably

BTW, the same procedure probably needs to be used for MIME messages of
type text/plain when a charset= is specified. I.e. usage of
mimify.CHARSET is really not appropriate anymore.