[Mailman-Developers] MIME-encoding headers/body for Mailman-generated mails

Ben Gertzfield che@debian.org
Fri, 26 Oct 2001 17:12:34 +0900


To finish up the Japanese support for Mailman, I'm going to dive in
and start by adding support for MIME-encoding and decoding (either
quoted-printable or base64, whichever is appropriate) header lines.

Right now, no matter what language is enabled, the localized emails
sent out through the virgin queue are sent verbatim.  We need to:

1) Encode the message (headers and body) with the encoding that locale
   uses for email.  Needed for EUC-JP to iso-2022-jp.  Can be used for
   Big5 and CN to iso-2022-cn (see http://www.imc.org/rfc1922) but I
   don't know if any Chinese mail readers actually support iso-2022-cn.

2) MIME-encode the Subject field, specifying the character set
   appropriate for the list's locale.  Use quoted-printable for
   ASCII-like charsets, base64 for non-ASCII-like charsets.

3) Set the charset in the Content-Type: of the mail, again appropriate
   for the list's locale.

4) (sometimes) MIME-encode the body with base64, for 8-bit character
   sets.  In a perfect world, all email would be 7-bit, but going by the
   Taiwanese spam I receive, people don't seem to send Chinese mail in
   iso-2022-cn.  Instead, the common thing to do seems to be sending
   8-bit Big5 mail that's base64 encoded (or not!  I get 8-bit email
   directly in Big5 from time to time in my spam folder.)

I looked at the email and mimify modules, and neither of them expose a
proper interface for MIME-encoding headers. Well, mimify *tries*,
really it does, but it forces quoted-printable, which makes no sense
for Asian languages, and does line-wrapping incorrectly, a HUGELY
important issue with double-byte encodings that will become corrupt if
they're line-wrapped in between two bytes of a double-byte character.

I think the proper place for this is in the email module, but I don't
want to re-invent the wheel. (Though I do understand the issue very
well, and have written up code to do this by-hand before, including
supporting double-byte charsets properly).  Can we get the code from
somewhere else, or should I write up encode_header and decode_header
methods for the email.Message class?

Next, we need to come up with a table mapping languages to the
encodings they use for email.  Right now, these are the encodings used
for our supported languages (from Defaults.py):

def add_language(code, description, charset):
    LC_DESCRIPTIONS[code] = (description, charset)

add_language('big5', _('Traditional Chinese'), 'big5')
add_language('de',   _('German'),              'iso-8859-1')
add_language('en',   _('English (USA)'),       'us-ascii')
add_language('es',   _('Spanish (Spain)'),     'iso-8859-1')
add_language('fr',   _('French'),              'iso-8859-1')
add_language('gb',   _('Simplified Chinese'),  'gb2312')
add_language('hu',   _('Hungarian'),           'iso-8859-1')
add_language('it',   _('Italian'),             'iso-8859-1')
add_language('ja',   _('Japanese'),            'euc-jp')
add_language('no',   _('Norwegian'),           'iso-8859-1')
add_language('ru',   _('Russian'),             'koi8-r')

We need another mapping, from 'code' to 'email charset conversion',
'header mime method', and 'body mime method'. (The last one may not be
necessary, if we are converting to a 7-bit encoding.)

This is what I understand is actually supported by email clients
people around the world use, but I could be very wrong.

email_charsets = {
 # code    mail conv  header enc  body enc
  'big5': [None,      'base64',   'base64'],
  'de':   [None,      'qp',       'qp'], 
  'en':   [None,      None,       None],
  'es':   [None,      'qp',       'qp'],
  'fr':   [None,      'qp',       'qp'],
  'gb':   [None,      'base64',   'base64'], # just a guess! use iso-2022-cn?
  'hu':   [None,      'qp',       'qp'], # I thought Hungarian was iso-8859-2?
  'it':   [None,      'qp',       'qp'], 
  'ja':   ['iso-2022-jp', 'base64', None],
  'no':   [None,      'qp',       'qp'],
  'ru':   [None,      'base64',   None], # I assume koi8-r is 7-bit..
}

I was surprised to see that we specify iso-8859-1 for Hungarian; I'm
pretty sure sure it uses accented vowels that are only in iso-8859-2.

Also, I don't know if people actually use iso-2022-cn in the Real
World.  The RFCs suggest to use it, but I get the feeling it's not
actually supported by Chinese email clients.  Anyone know?  If we
can use it, then we should for both big5 and gb.

In any case, there's a wonderful Python 2.0 codec module which I'm
testing now, that makes it possible to convert to/from Japanese.  I am
VERY unhappy with the historically used kconv.py module, which has
thrown tracebacks at me whenever it sees encodings it doesn't
understand.  It should be good for our purposes; we could just ship it
in 'misc' for folks who need Japanese.

http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/

Ben

-- 
Brought to you by the letters R and Y and the number 9.
"Hoosh is a kind of soup."
Debian GNU/Linux maintainer of Gimp and GTK+ -- http://www.debian.org/