[Mailman-Developers] [PATCH] Proof-of-concept: proper MIME i18n mails from Mailman

Mon, 19 Nov 2001 20:40:11 +0900

I've done it!  Finally, I got to finish the work I had intended on
doing for so long: adding proper MIMEification of i18n mails produced
within Mailman.  This works for both the subject and the body of any
mail produced from the UserNotification module.

This code is split up into a patch to the email module, including
a new Charset class to maintain details on each character set's
characteristics in the email world, and a patch to Mailman to use
the new functionality.  This email includes the first patch, to
the email module.  I will send a separate email with the patch
to Mailman to use this new code.

This is not the final version of the patch, but I'm really excited
that I finally got even multibyte character sets like Japanese working
properly that I wanted to post my progress.  

Right now, any mail generated by Mailman will be internationalized
properly, including encoding the header and body with Quoted-Printable
or Base64 as appropriate.  

I tested it by setting DEFAULT_SERVER_LANGUAGE = 'ja' in 
mm_cfg.py and running 'newlist', and it sent out the following:

Received: from localhost (HELO nausicaa.interq.or.jp) (127.0.0.1)
  by localhost with SMTP; 19 Nov 2001 10:03:27 -0000
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-2022-jp"
Content-Transfer-Encoding: 7bit
Subject: =?iso-2022-jp?b?GyRCJCIkSiQ/JE4/NyQ3JCQlYSE8JWobKEI=?=
 =?iso-2022-jp?b?GyRCJXMlMCVqJTklSBsoQjogbGFuZ3Rlc3Q0?=
From: mailman-admin@nausicaa.interq.or.jp
To: ben@gmo.jp
Message-ID: <mailman.0.1006164205.26198.langtest4@nausicaa.interq.or.jp>
Sender: langtest4-admin@nausicaa.interq.or.jp
Errors-To: langtest4-admin@nausicaa.interq.or.jp
X-BeenThere: langtest4@nausicaa.interq.or.jp
X-Mailman-Version: 2.1a3+
Precedence: bulk

(snip rest of headers and properly encoded Japanese body)

You'll notice that the Subject line has been converted from the EUC-JP
encoding used in Mailman's Japanese localization to ISO-2022-JP, used
in emails.  

It's also been split *SAFELY* on the magic 76 character boundry -- we
have to make sure we split on a *CHARACTER* boundary, not a byte
boundary, when splitting Japanese strings, as multibyte characters
will get messed up if they're split between bytes. Also, the
MIME-Version, Content-Type, and CTE headers are added correctly.

The body was also encoded correctly in ISO-2022-JP, but I didn't want
to include it and confuse other folks' possibly-not-Japanese-aware
mail readers.

I'm using the wonderful Japanese unicode codec for python from
http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ to convert to
and from EUC-JP/ISO-2022-JP, through Unicode.  It works splendidly,
and is the perfect solution to split multibyte character sets on
character (not byte) boundaries for wrapping long header lines.

My code deals with ASCII-like character sets by using
quoted-printable for them, but I noticed that a few of the
translations wuss out from using 8-bit characters!  The Spanish
translation uses HTML escapes, which makes it useless for emails:

[ben@nausicaa:~/src/mailman/mailman/messages/es/LC_MESSAGES]% grep eacute mailman.po
"        tambi&eacute;n puede <a href=\"%(creatorurl)s\">crear una lista de\n"
"    podr&iacute;a hacer que otras pantallas est&eacute;n desincronizadas. \n"
"    Asegurese de recargar cualquier otra p&aacute;gina que est&eacute; "
"Tambi&eacute;n puede "

(snip)

These should be looked at.

This code is mostly-tested, but I had to change a *lot* of
UserNotification calls, so I'm not sure if I guessed the lang
parameter right in all cases.  

I could get the language detection code working great when I ran
"newlist" from the command line, but when I tried to use the CGI
interface, I got a bounce in the logs but no message anywhere of a
traceback, so it needs more debugging.

Below is the patch to email-0.95, to enable the charset related
code we need to get Mailman to do all this cool stuff.  The next
email will include the Mailman patch to use this oh-so-required
feature.

diff -ruN email.orig/Charset.py email/Charset.py

--- email.orig/Charset.py	Thu Jan  1 09:00:00 1970
+++ email/Charset.py	Mon Nov 19 19:20:49 2001
@@ -0,0 +1,165 @@
+# Copyright (C) 2001 Python Software Foundation
+# Author: che@debian.org (Ben Gertzfield)
+
+import types
+import codecs
+
+# Flags for types of header encodings
+QP     = 1  # Quoted-Printable
+BASE64 = 2  # Base64
+
+class Charset:
+    """Map charsets to their email characteristics.
+
+    This module deals with the nitty-gritty of email character set
+    internals, and has a bit of "do-what-I-mean" functionality.  Given
+    an character set, it will do its best to provide information
+    on how to use that character set in an email.
+    
+    Certain character sets must be encoded with quoted-printable or
+    Base64 when used in email headers or bodies, and certain character
+    sets must be converted outright, and are not allowed in email.
+    Instances of this module expose the following information about
+    a character set:
+
+    input_charset: The initial character set specified.  Common aliases
+      are converted to their "official" email names (like latin_1 to
+      iso-8859-1).
+
+    header_encoding: If the character set must be encoded before using
+      it in an email header, set to Charset.QP (for quoted- printable)
+      or Charset.BASE64 (for base64 encoding).  Otherwise, it will be
+      None.
+
+    body_encoding: Same as header_encoding,  but for the email's body.
+
+    output_charset: Some character sets must be converted before using
+      them in email headers or bodies.  If the input_charset is one
+      of them, set to another character set.
+
+    input_codec: The name of the Python codec used to convert the
+      input_charset to Unicode.
+
+    output_codec: The name of the Python codec used to convert Unicode
+      to the output_charset.
+    """
+    charset_map = {
+        # input       header enc  body enc output conv
+        'iso-8859-1': [QP,        QP,      None], 
+        'iso-8859-2': [QP,        QP,      None],
+        'us-ascii':   [None,      None,    None],
+        'big5':       [BASE64,    BASE64,  None],
+        'gb2312':     [BASE64,    BASE64,  None], 
+        'euc-jp':     [BASE64,    None,    'iso-2022-jp'],
+        'shift_jis':  [BASE64,    None,    'iso-2022-jp'],
+        'iso-2022-jp':[BASE64,    None,    None],
+        'koi8-r':     [BASE64,    BASE64,  None],
+        }
+
+    # Aliases for other commonly-used names for character sets.  Map
+    # them to the real ones used in email.
+    alias_map = {
+        'latin_1': 'iso-8859-1',
+        'ascii': 'us-ascii'
+        }
+    
+    # Map charsets to their unicode codec strings.
+    # Note that the japanese examples included below do not (yet) come with
+    # Python!  They are available from:
+    # http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/
+    
+    unicode_map = {
+        'euc-jp': 'japanese.euc-jp',
+        'iso-2022-jp': 'japanese.iso-2022-jp',
+        'shift_jis': 'japanese.shift_jis',
+        }
+
+
+    def __init__(self, input_charset='iso-8859-1', header_encoding=None,
+                 body_encoding=None, output_charset=None, input_codec=None,
+                 output_codec=None):
+        if self.alias_map.has_key(input_charset):
+            self.input_charset = input_charset = self.alias_map[input_charset]
+        else:
+            self.input_charset = input_charset
+
+        # We can try to guess which encoding and conversion to use by the
+        # charset_map dictionary.  Try that first, but let the user
+        # override it.
+        if self.charset_map.has_key(input_charset):
+            self.header_encoding = (header_encoding or
+                                    self.charset_map[input_charset][0])
+            self.body_encoding = (body_encoding or
+                                  self.charset_map[input_charset][1])
+            self.output_charset = (output_charset or
+                                   self.charset_map[input_charset][2] or
+                                   input_charset)
+            
+            if self.alias_map.has_key(self.output_charset):
+                self.output_charset = output_charset = self.alias_map[output_charset]
+            if self.output_charset is not None:
+                if self.unicode_map.has_key(self.input_charset):
+                    self.input_codec = (input_codec or
+                                        self.unicode_map[self.input_charset])
+                else:
+                    self.input_codec = input_codec
+                if self.unicode_map.has_key(self.output_charset):
+                    self.output_codec = (output_codec or
+                                         self.unicode_map[self.output_charset] or
+                                         self.input_codec)
+                else:
+                    self.output_codec = output_codec or self.input_codec
+            else:
+                self.input_codec = self.output_codec = None
+        # Otherwise, just go with what the user said.
+        else:
+            self.header_encoding = header_encoding
+            self.body_encoding = body_encoding
+            self.output_charset = output_charset
+            self.input_codec = input_codec
+            self.output_codec = output_codec or self.input_codec
+
+    def to_output(self, str):
+        """Convert the string from the input_codec to the output_codec."""
+        if self.input_codec <> self.output_codec:
+            return unicode(str, self.input_codec).encode(self.output_codec)
+        else:
+            return str
+
+    def to_splittable(self, str):
+        """Convert a possibly multibyte string to a safely splittable format.
+
+        Uses the input_codec to try and convert the string to Unicode, so
+        it can be safely split on character boundaries (even for double-byte
+        characters).
+
+        Returns the string untouched if it doesn't know how to convert it
+        to Unicode with the input_charset.
+
+        Will raise ValueError if the input_codec is not installed on the system.
+        """
+        if type(str) is types.UnicodeType:
+            return str
+        
+        if self.input_codec is not None:
+            return unicode(str, self.input_codec)
+        else:
+            return str
+
+    def from_splittable(self, uni_string):
+        """Convert a splittable string back into an encoded string.
+
+        Uses the output_codec to try and convert the string from Unicode
+        back into its output encoded format.  Returns the string as-is if
+        it is not Unicode, or if it could not be encoded from Unicode.
+
+        Will raise ValueError if the output_codec is not installed on the system.
+        """
+        if type(uni_string) is not types.UnicodeType:
+            return uni_string
+        
+        if self.output_codec is not None:
+            return uni_string.encode(self.output_codec)
+        else:
+            return uni_string
+
diff -ruN email.orig/Encoders.py email/Encoders.py
--- email.orig/Encoders.py	Tue Oct  2 04:29:38 2001
+++ email/Encoders.py	Mon Nov 19 20:36:53 2001
@@ -6,8 +6,13 @@
 
 import base64
 import quopri
+import re
+from binascii import b2a_base64
 from cStringIO import StringIO
 
+from Charset import Charset, QP, BASE64
+
+CRLFSPACE = "\015\012 "
 
 
 # Helpers
@@ -24,6 +29,15 @@
         return value[:-1]
     return value
 
+def _max_append(list, str, maxlen):
+    if len(list) == 0:
+        list.append(str)
+        return
+    
+    if len(list[-1] + str) < maxlen:
+        list[-1] += str
+    else:
+        list.append(str)
 
 def _bencode(s):
     # We can't quite use base64.encodestring() since it tacks on a "courtesy
@@ -36,6 +50,17 @@
         return value[:-1]
     return value
 
+def _chunk_append(chunks, header, goodlinelen=75):
+    if len(chunks) == 0:
+        chunks.append(header)
+        return
+    
+    for chunk in header.split(CRLFSPACE):
+        if len(chunks[-1] + chunk) < goodlinelen:
+            chunks[-1] += " " + chunk
+        else:
+            chunks.append(chunk)
+
 
 
 def encode_base64(msg):
@@ -78,3 +103,276 @@
 
 def encode_noop(msg):
     """Do nothing."""
+
+
+def qencode_len(str):
+    """Return the length of str when it is encoded with header q-p."""
+    count = 0
+    
+    for c in str:
+        if ((c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or
+            (c >= '0' and c <= '9') or (c in ('!', '*', '+', '-', '/', ' '))):
+            count += 1
+        else:
+            count += 3
+
+    return count
+
+def bencode_len(str):
+    """Return the length of str when it is encoded with base64."""
+    return (len(str) / 3) * 4
+
+def header_qencode(header, charset="iso-8859-1", maxlinelen=75):
+    """Encode a header line with quoted-printable (like) encoding.
+
+    Defined in RFC 2045, this "Q" encoding is similar to
+    quoted-printable, but used specifically for email header fields to
+    allow charsets with mostly 7 bit characters (and some 8 bit) to
+    remain more or less readable in non-RFC 2045 aware mail clients.    
+
+    The resulting string will be in the form:
+
+    "=?charset?q?I_f=E2rt_in_your_g=E8n=E8ral_dire=E7tion?=\r\n
+      =?charset?q?Silly_=C8nglish_Kn=EEghts?="
+
+    with each line wrapped safely at, at most, maxlinelen characters.
+    It is safe to use verbatim in any email header field, as the
+    wrapping is performed in a quoted-printable aware way and each
+    linefeed is a \r\n.
+
+    charset defaults to "iso-8859-1", and maxlinelen defaults to 75
+    characters.
+    """
+    quoted = []
+
+    # =? plus ?q? plus ?= is 7 characters
+    maxlen = maxlinelen - len(charset) - 7
+    
+    for c in header:
+        # Space may be represented as _ instead of =20 for readability
+        if c == ' ':
+            _max_append(quoted, "_", maxlen)
+        # These characters can be included verbatim
+        elif ((c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or
+              (c >= '0' and c <= '9') or (c in ('!', '*', '+', '-', '/'))):
+            _max_append(quoted, c, maxlen)
+        # Otherwise, replace with hex value like =E2
+        else:
+            _max_append(quoted, "=%02X" % (ord(c)), maxlen)
+
+    encoded = ""
+
+    for q in quoted:
+        # Any chunk past the fir7st must start with "\r\n "
+        if len(encoded) > 0:
+            encoded += CRLFSPACE
+        encoded += "=?%s?q?%s?=" % (charset, q)
+
+    return encoded
+
+def header_bencode(header, charset="iso-8859-1", maxlinelen=75):
+    """Encode a header line with Base64 encoding and a charset specification.
+    
+    Defined in RFC 2045, this Base64 encoding is identical to normal
+    Base64 encoding, except that each line must be intelligently
+    wrapped (respecting the Base64 encoding), and subsequent lines must
+    start with a space.  
+
+    The resulting string will be in the form:
+
+    "=?charset?b?WW/5ciBtYXp66XLrIHf8eiBhIGhhbXBzdGHuciBBIFlv+XIgbWF6euly?=\r\n
+      =?charset?b?6yB3/HogYSBoYW1wc3Rh7nIgQkMgWW/5ciBtYXp66XLrIHf8eiBhIGhh?="
+      
+    with each line wrapped at, at most, maxlinelen characters. It is
+    safe to use verbatim in any email header field, as the wrapping is
+    performed in a quoted-printable aware way and each linefeed is a
+    \r\n.
+
+    charset defaults to "iso-8859-1", and maxlinelen defaults to 75
+    characters.
+    """
+
+    if len(header) == 0:
+        return header
+    
+    base64ed = []
+
+    max_encoded = maxlinelen - len(charset) - 7
+    max_unencoded = bencode_len(header)
+
+    for i in xrange(0, len(charset), max_unencoded):
+        base64ed.append(b2a_base64(header[i:i+max_unencoded]))
+
+    encoded = ""
+
+    for b in base64ed:
+        if len(encoded) > 0:
+            encoded += CRLFSPACE
+        # We ignore the last character of each line if it is a \n.
+        if b[-1] == '\n':
+            b = b[:-1]
+
+        encoded += "=?%s?b?%s?=" % (charset, b)
+
+    return encoded
+
+def encode_header_chunks(header_chunks):
+    """MIME-encode a header with many different charsets and/or encodings.
+
+    Given a list of pairs [string, charset], return a MIME-encoded
+    string suitable for use in a header field.  Each triplet may have
+    different charsets and/or encodings, and the resulting header will
+    accurately reflect each setting.
+
+    Each encoding can be email.Utils.QP (quoted-printable, for
+    ASCII-like character sets like iso-8859-1), email.Utils.BASE64
+    (Base64, for non-ASCII like character sets like KOI8-R and
+    iso-2022-jp), or None (no encoding).
+    
+    Each triplet will be represented on a separate line; the resulting
+    string will be in the format:
+
+    "=?charset1?q?Mar=EDa_Gonz=E1lez_Alonso?=\r\n
+      =?charset2?b?SvxyZ2VuIEL2aW5n?="
+    """
+    chunks = []
+
+    for header, charset in header_chunks:
+        if charset is None or charset.header_encoding is None:
+            _chunk_append(chunks, header)
+        else:
+            if charset.header_encoding is QP:
+                _chunk_append(chunks,
+                              header_qencode(header, charset.output_charset))
+                    
+            elif charset.header_encoding is BASE64:
+                _chunk_append(chunks,
+                              header_bencode(header, charset.output_charset))
+
+    return CRLFSPACE.join(chunks)
+
+def encode_address(real_name, address, charset=Charset("iso-8859-1")):
+    """MIME-encode a header field intended for an address (from, to, cc, etc.)
+    
+    Given a real name, an email address, and optionally the real
+    name's character set (as a Charset object, defaulting to
+    iso-8859-1), return a 7-bit MIME-encoded string suitable for use
+    in a From, To, Cc, or other email header field.
+    
+    The resulting string will be in the format:
+    
+    "=?charset?q?Kevin_Phillips_B=F6ng?= <philips@slightly.silly.party.go.uk>"
+    
+    and can be included verbatim in an email header field.  Even
+    very long addresses are handled properly with this method:
+    
+    "=?charset?q?T=E4rquin_Fintimlinbinhinbimlim_Bus_St=F6p_Poontang_Poont?=\r\n
+      =?charset?q?ang_Ol=E9_Biscuit-Barrel?=\r\n
+      <tarquin@very.silly.party.go.uk>"
+    """       
+
+    if not real_name:
+        return address
+
+    header_chunks = split_encode(real_name, charset)
+    header_chunks.append(["<%s>" % address, Charset("us-ascii")])
+    return encode_header_chunks(header_chunks)
+
+def encode_header(header, charset=Charset("iso-8859-1")):
+    """Encode a message header, possibly converting charset and encoding.
+
+    There are many issues involved in converting a given string for
+    use in an email header.  Only certain character sets are readable
+    in most email clients, and as header strings can only contain a
+    subset of 7-bit ASCII, care must be taken to properly convert and
+    encode (with Base64 or quoted-printable) header strings.  In
+    addition, there is a 75-character length limit on any given
+    encoded header field, so line-wrapping must be performed, even
+    with double-byte character sets.
+
+    This function, given a header string and its character set, will
+    do its best to convert the string to the correct character set
+    used in email, and encode and line wrap it safely with the
+    appropriate scheme for that character set.  If the given
+    input_charset is not known or an error occurs during conversion,
+    this function will return the header untouched.
+    """
+
+    if not header:
+        return header
+
+    header = fix_eols(header)
+
+    header_chunks = header_split(header, charset)
+
+    return encode_header_chunks(header_chunks)
+
+def encode_body(body, charset=Charset("iso-8859-1")):
+    """Encode a message body, possibly converting charset and encoding."""
+
+    if not body:
+        return body
+
+    if charset.body_encoding is QP:
+        infp = StringIO(charset.to_output(body))
+        outfp = StringIO()
+        quopri.encode(infp, outfp, quotetabs=0)
+        return fix_eols(outfp.getvalue())
+    elif charset.body_encoding is BASE64:
+        encoded = ""
+        for s in body_base64_split(body, charset):
+            encoded += b2a_base64(s)
+        return s
+    else:
+        return charset.to_output(body)
+    
+def body_base64_split(body, charset, maxlinelen=75):
+    """Split up a body safely, respecting encoding and mulibyte charsets."""
+
+    splittable = charset.to_splittable(body)
+    encoded = charset.from_splittable(splittable)
+
+    length = bencode_len(encoded)
+
+    if length < maxlinelen:
+        return [encoded]
+    else:
+        # divide and conquer
+        first = charset.from_splittable(splittable[:len(splittable)/2])
+        last = charset.from_splittable(splittable[len(splittable)/2:])
+        
+        return (body_base64_split(first, charset) +
+                body_base64_split(last, charset))
+
+def header_split(header, charset, maxlinelen=75):
+    """Split up a header safely for use with encode_header_chunks."""
+
+    splittable = charset.to_splittable(header)
+    encoded = charset.from_splittable(splittable)
+
+    length = len(encoded)
+    
+    if charset.header_encoding is QP:
+        length = qencode_len(encoded)
+    elif charset.header_encoding is BASE64:
+        length = bencode_len(encoded)
+
+    if length < maxlinelen - len(charset.output_charset) - 7:
+        return [(encoded, charset)]
+    else:
+        # divide and conquer
+        first = charset.from_splittable(splittable[:len(splittable)/2])
+        last = charset.from_splittable(splittable[len(splittable)/2:])
+        return (header_split(first, charset) +
+                header_split(last, charset))
+                    
+def fix_eols(str):
+    """Replace all line-ending characters with \r\n."""
+
+    # Fix newlines with no preceding carriage return
+    str = re.sub(r"(?<!\r)\n", r"\r\n", str)
+    # Fix carriage returns with no following newline
+    str = re.sub(r"\r(?!\n)", r"\r\n", str)
+
+    return str
+


-- 
Brought to you by the letters M and G and the number 13.
"If you turn both processors off, you will have to reboot."
Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/