[Python-bugs-list] [ python-Bugs-552957 ] email.Utils.encode doesn't obey rfc2047

noreply@sourceforge.net noreply@sourceforge.net
Mon, 17 Jun 2002 10:36:55 -0700


Bugs item #552957, was opened at 2002-05-06 14:13
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=552957&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Ty Sarna (tsarna)
>Assigned to: Barry A. Warsaw (bwarsaw)
Summary: email.Utils.encode doesn't obey rfc2047

Initial Comment:
The email.Utils.encoding function has two bugs, which
are somewhat related -- it fails to deal with long
input strings in two different ways.

First, newlines are not allowed in the middle of
rfc2047 encoded-words (per section 2: "[...] white
space characters MUST NOT appear between components of
an 'encoded-word'"). The _bencode and _qencode routines
that the encode function uses include newlines (or
"=\n"'s for quopri) in their output, and the encode
function doesn't remove them. Try encoding a long
string with 'q' for example. The resulting output will
contain one or more "= \n"'s, and the
email.Utils.decode function will not be able to parse it.

Patch:

*** Utils.py.orig       Mon May  6 13:17:05 2002
--- Utils.py    Mon May  6 13:18:16 2002
***************
*** 98,106 ****
      """Encode a string according to RFC 2047."""
      encoding = encoding.lower()
      if encoding == 'q':
!         estr = _qencode(s)
      elif encoding == 'b':
!         estr = _bencode(s)
      else:
          raise ValueError, 'Illegal encoding code: ' +
encoding
      return '=?%s?%s?%s?=' % (charset.lower(),
encoding, estr)
--- 98,106 ----
      """Encode a string according to RFC 2047."""
      encoding = encoding.lower()
      if encoding == 'q':
!         estr = _qencode(s).replace('=\n','')
      elif encoding == 'b':
!         estr = _bencode(s).replace('\n','')
      else:
          raise ValueError, 'Illegal encoding code: ' +
encoding
      return '=?%s?%s?%s?=' % (charset.lower(),
encoding, estr)

NOTE: The .replace()-ing should NOT be done in _bencode
and _quencode, because they're used other places where
their current behaviour is fine/expected.


Second problem: rfc2047 specifies that an encoded-word
 may be no longer than 75 characters (see section 2).
Also, in the case of, say, a From: header with high-bit
characters in the sender's name, you really want to
encode only the name, not the whole line, so that dumb
mail programs are able to recognize the email address
in the line without having to understand rfc2047.

Proposed solution: rename existing encode function
(with above patche applied) to encode_word. Add a new
encode function that splits the input string into a
list of words and whitespace runs.  Words are encoded
individually using encode_word() iff they are not pure
ascii. The results are then concatenated back with
original whitespace.

This still leaves the possibility that a single word,
when encoded, is longer than 75 characters. The
recommended practice in rfc2047 is to use multiple
encoded words separated by CRLF SPACE (or in our case ,
"\n "). 


Here is code that implements the above:

wsplit = re.compile('([ \n\t]+)').split


def encode(s, charset='iso-8859-1', encoding='q'):
    i = wsplit(s)
    o = []

    # max encoded-word length per rfc2047 section 2 is 75
    # 75 - len("=?" + "?" + "?" + "?=") == 69
    max_enc_text = 69 - len(charset) - len(encoding)
    if encoding == 'q':
        # 3 bytes per character worst case
        safe_wlen = max_enc_text / 3
    elif encoding == 'b':
        safe_wlen = (max_enc_text * 6) / 8
    else:
        safe_wlen = max_enc_text # ?

    for w in i:
        if w[0] in " \n\t":
            o.append(w)
        else:
            try:
                o.append(w.encode('ascii'))
            except:
                ew = encode_word(w, charset, encoding)
                while len(ew) > 75:
                   
o.append(encode_word(w[:safe_wlen],charset,encoding)+"\n ")
                    w = w[safe_wlen:]
                    ew = encode_word(w, charset, encoding)
                o.append(ew)

    return ''.join(o)


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=552957&group_id=5470