[PATCH] Header q-p/base64 RFC 2047 encoding for email module
The following patch to the email module implements the RFC 2047-specified Base64 and quoted-printable (called "B" and "Q" encoding by the RFC) for header-safe encoding of 8-bit strings, for From:, To:, Subject:, and other fields. It includes charset information within the encoded strings themselves, which, along with the special line-wrapping algorithm needed for B and Q encoding, make this a very useful general feature for internationalized Python email programs. Most MIME-aware mail readers in use today understand the RFC 2047 convention, and in the East Asian world, it's 100% necessary to send subject and address fields in Base64 encoding. Mailman needs this functionality in order to send out localized emails from the virgin queue; without it, it's very possible that 8-bit characters will be blindly placed into the Subject: and To: fields. This also allows localized List-Id fields, as a bonus! This patch adds the following functions to email.Utils: encode_address(real_name, address, charset="iso-8859-1", encoding=QP): MIME-encode a header field intended for an address (from, to, cc, etc.) encode_header(header, charset="iso-8859-1", encoding=QP): MIME-encode a general email header field (eg. Subject). encode_header_chunks(header_chunks): MIME-encode a header with many different charsets and/or encodings. It also adds the following support functions to email.Encoders. header_qencode(header, charset="iso-8859-1", maxlinelen=75): Encode a header line with quoted-printable (like) encoding. header_bencode(header, charset, maxlinelen=75): Encode a header line with Base64 encoding and a charset specification. I needed to re-implement the quoted-printable algorithm in header_qincode because the "Q" encoding specified by RFC 2045 is different in a few key areas from the one implemented in quopri.py, and the line-wrapping at 75 characters got too hairy with just quopri.py. Patch follows, against email 0.95. (Sorry, I tried CVS, but I didn't want to install Python 2.2 beta just yet.) I will work on integrating this into Mailman tomorrow. diff -ruN email.orig/Encoders.py email/Encoders.py --- email.orig/Encoders.py Tue Oct 2 04:29:38 2001 +++ email/Encoders.py Wed Nov 14 19:07:24 2001 @@ -6,8 +6,10 @@ import base64 import quopri +from binascii import b2a_base64 from cStringIO import StringIO +CRLFSPACE = "\015\012 " # Helpers @@ -24,6 +26,15 @@ return value[:-1] return value +def _max_append(list, str, maxlen): + if len(list) == 0: + list.append(str) + return + + if len(list[-1] + str) < maxlen: + list[-1] += str + else: + list.append(str) def _bencode(s): # We can't quite use base64.encodestring() since it tacks on a "courtesy @@ -78,3 +89,91 @@ def encode_noop(msg): """Do nothing.""" + + +def header_qencode(header, charset="iso-8859-1", maxlinelen=75): + """Encode a header line with quoted-printable (like) encoding. + + Defined in RFC 2045, this "Q" encoding is similar to + quoted-printable, but used specifically for email header fields to + allow charsets with mostly 7 bit characters (and some 8 bit) to + remain more or less readable in non-RFC 2045 aware mail clients. + + The resulting string will be in the form: + + "=?charset?q?I_f=E2rt_in_your_g=E8n=E8ral_dire=E7tion?=\r\n + =?charset?q?Silly_=C8nglish_Kn=EEghts?=" + + with each line wrapped safely at, at most, maxlinelen characters. + It is safe to use verbatim in any email header field, as the + wrapping is performed in a quoted-printable aware way and each + linefeed is a \r\n. + + charset defaults to "iso-8859-1", and maxlinelen defaults to 75 + characters. + """ + quoted = [] + + # =? plus ?q? plus ?= is 7 characters + maxlen = maxlinelen - len(charset) - 7 + + for c in header: + # Space may be represented as _ instead of =20 for readability + if c == ' ': + _max_append(quoted, "_", maxlen) + # These characters can be included verbatim + elif ((c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or + (c >= '0' and c <= '9') or (c in ('!', '*', '+', '-', '/'))): + _max_append(quoted, c, maxlen) + # Otherwise, replace with hex value like =E2 + else: + _max_append(quoted, "=%02X" % (ord(c)), maxlen) + + encoded = "" + + for q in quoted: + # Any chunk past the fir7st must start with "\r\n " + if len(encoded) > 0: + encoded += CRLFSPACE + encoded += "=?%s?q?%s?=" % (charset, q) + + return encoded + +def header_bencode(header, charset, maxlinelen=75): + """Encode a header line with Base64 encoding and a charset specification. + + Defined in RFC 2045, this Base64 encoding is identical to normal + Base64 encoding, except that each line must be intelligently + wrapped (respecting the Base64 encoding), and subsequent lines must + start with a space. + + The resulting string will be in the form: + + "=?charset?b?WW/5ciBtYXp66XLrIHf8eiBhIGhhbXBzdGHuciBBIFlv+XIgbWF6euly?=\r\n + =?charset?b?6yB3/HogYSBoYW1wc3Rh7nIgQkMgWW/5ciBtYXp66XLrIHf8eiBhIGhh?=" + + with each line wrapped at, at most, maxlinelen characters. It is + safe to use verbatim in any email header field, as the wrapping is + performed in a quoted-printable aware way and each linefeed is a + \r\n. + + charset defaults to "iso-8859-1", and maxlinelen defaults to 75 + characters. + """ + base64ed = [] + + maxlen = ((maxlinelen - len(charset) - 7) / 4) * 3 + num_lines = (len(header) / maxlen) + 1 + + for i in xrange(0, num_lines): + base64ed.append(b2a_base64(header[i*maxlen:(i+1)*maxlen])) + + encoded = "" + + for b in base64ed: + if len(encoded) > 0: + encoded += CRLFSPACE + # We ignore the last character of each line, which is a \n. + encoded += "=?%s?b?%s?=" % (charset, b[:-1]) + + return encoded diff -ruN email.orig/Utils.py email/Utils.py --- email.orig/Utils.py Sat Nov 10 02:07:44 2001 +++ email/Utils.py Wed Nov 14 19:16:00 2001 @@ -17,11 +17,16 @@ import base64 # Intrapackage imports -from Encoders import _bencode, _qencode +from Encoders import _bencode, _qencode, header_qencode, header_bencode COMMASPACE = ', ' UEMPTYSTRING = u'' +CRLFSPACE = "\015\012 " + +# Flags for types of header encodings +QP = 1 # Quoted-Printable +BASE64 = 2 # Base64 # Helpers @@ -56,6 +61,16 @@ return value[:-1] return value +def _chunk_append(chunks, header, goodlinelen=75): + if len(chunks) == 0: + chunks.append(header) + return + + for chunk in header.split(CRLFSPACE): + if len(chunks[-1] + chunk) < goodlinelen: + chunks[-1] += " " + chunk + else: + chunks.append(chunk) def getaddresses(fieldvalues): @@ -156,3 +171,90 @@ 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'][now[1] - 1], now[0], now[3], now[4], now[5], zone) + +def encode_address(real_name, address, charset="iso-8859-1", encoding=QP): + """MIME-encode a header field intended for an address (from, to, cc, etc.) + + Given an 8-bit string containing a real name, an email address, + and optionally the real name's character set, and the encoding + you wish to use with it, return a 7-bit MIME-encoded string + suitable for use in a From, To, Cc, or other email header + field. + + The encoding can be email.Utils.QP (quoted-printable, for + ASCII-like character sets like iso-8859-1), email.Utils.BASE64 + (Base64, for non-ASCII like character sets like KOI8-R and + iso-2022-jp), or None (no encoding). + + The charset defaults to "iso-8859-1", and the encoding defaults + to email.Utils.QP. + + The resulting string will be in the format: + + "=?charset?q?Kevin_Phillips_B=F6ng?= <philips@slightly.silly.party.go.uk>" + + and can be included verbatim in an email header field. Even + very long addresses are handled properly with this method: + + "=?charset?q?T=E4rquin_Fintimlinbinhinbimlim_Bus_St=F6p_Poontang_Poont?=\r\n + =?charset?q?ang_Ol=E9_Biscuit-Barrel?=\r\n + <tarquin@very.silly.party.go.uk>" + """ + + return encode_header_chunks([ [real_name, charset, encoding], + ["<%s>" % address, None, None] ]) + +def encode_header(header, charset="iso-8859-1", encoding=QP): + """MIME-encode a general email header field (eg. Subject). + + Given an 8-bit header string, and optionally its charset and the + encoding you wish to use, return a 7-bit MIME-encoded string + suitable for use in a general email header (but most useful for + the Subject: line). + + The encoding can be email.Utils.QP (quoted-printable, for + ASCII-like character sets like iso-8859-1), email.Utils.BASE64 + (Base64, for non-ASCII like character sets like KOI8-R and + iso-2022-jp), or None (no encoding). + + The charset defaults to "iso-8859-1", and the encoding defaults + to email.Utils.QP. + """ + return encode_header_chunks([[header, charset, encoding]]) + +def encode_header_chunks(header_chunks): + """MIME-encode a header with many different charsets and/or encodings. + + Given a list of triplets [ [string, charset, encoding] ], return a + MIME-encoded string suitable for use in a header field. Each triplet + may have different charsets and/or encodings, and the resulting header + will accurately reflect each setting. + + Each encoding can be email.Utils.QP (quoted-printable, for + ASCII-like character sets like iso-8859-1), email.Utils.BASE64 + (Base64, for non-ASCII like character sets like KOI8-R and + iso-2022-jp), or None (no encoding). + + Each triplet will be represented on a separate line; the resulting + string will be in the format: + + "=?charset1?q?Mar=EDa_Gonz=E1lez_Alonso?=\r\n + =?charset2?b?SvxyZ2VuIEL2aW5n?=" + """ + chunks = [] + + for header, charset, encoding in header_chunks: + encoded = "" + encoding_char = "" + + if encoding is None: + _chunk_append(chunks, header) + else: + if encoding is QP: + _chunk_append(chunks, header_qencode(header, charset)) + + elif encoding is BASE64: + _chunk_append(chunks, header_bencode(header, charset)) + + return CRLFSPACE.join(chunks) + -- Brought to you by the letters A and H and the number 10. "Wuzzle means to mix." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
participants (1)
-
Ben Gertzfield