[I18n-sig] Autoguessing charset for Unicode strings?

Barry A. Warsaw barry@digicool.com
Tue, 19 Jun 2001 18:59:31 -0400


I just don't know enough about Unicode in general (I've been one of
those eye-glazers Skip refers to ;), so I figured I'd ask this
question here.  First, some background.

I'm trying to add support for RFC 2047 in mimelib.  Essentially, this
RFC specifies how to include non-ASCII characters in mail headers, by
describing an encoding format.  The format lets you wrap "funny"
characters in something like: =?iso-8859-1?Q?B=E2rry W=E2rs=E2w?=

So, I think I've got the first part working, which is this: when I see
such an encoded header, I pull out the encoded string, quopri decode
it[*], then coerce to Unicode, giving the charset part as the second
argument to unicode().  Specifically, the algorithm is something like:

    parts = value.split('?')
    if parts[0].endswith('=') and parts[4].startswith('='):
	charset = parts[1]
	encoding = parts[2].lower()
	atom = parts[3]
	if encoding == 'q':
	    decoded_atom = quopri.decodestring(atom)
	elif encoding == 'b':
	    decoded_atom = base64.decodestring(atom)
	else:
	    raise ValueError, 'bad encoding: %s' % encoding
	return unicode(decoded_atom, charset)

So far so good.  Now let's say I want to go in the other direction,
i.e. given a Unicode string, I want to create the RFC 2047 encoded
string to add to the header, so I need to be able to go "the other way
'round".  Is this possible without requiring the user to explicitly
provide the charset that the Unicode string is encoded with?

My understanding is that the unicode string doesn't have a notion of
the charset that it was encoded with, but is it possible to guess the
charset of a Unicode string reliably?  Even if you can only guess 80%
of the time, that'd be fine if I can throw an exception for the other
20%.  Is there an existing Python solution for this?  Does my question
even make sense? ;)

Thanks,
-Barry

[*] The `Q' (or `q') in between the ?'s means the string is encoded
using quoted-printable.  Thus the recent rash of fixes to the quopri
module.  The RFC says that alternatively, a `B' (or `b') is valid,
meaning Base64 was used.