[Mailman-Developers] About Mailman's unicode-enabled Message subclass

Mon Dec 1 17:45:58 CET 2014

Hi * !

I'm trying to fix bug 1060951 (that I reported), and no matter how I poke
at it, I can't find the proper way to solve it, so I'd like to share my
thoughts with you.

https://bugs.launchpad.net/mailman/+bug/1060951

The issue is with our unicode-enabled subclass of Python's
email.message.Message class. Our subclass converts headers to unicode in
the get() and __getitem__() methods. However, when an email has an
attachment, one of the ways of representing the filename is to urlencode it
(RFC 2231). The urldecoding is thus done on an unicode string, which is
then passed to email.utils.collapse_rfc2231_value and decoded to unicode
again. Of course, this fails, an unicode string can't be decoded twice.

I tried re-implementing more of the original Message class to avoid this
second decoding, but then I get another problem: the filename has already
been urldecoded. Here's an example to make it clearer:

>>> m1 = email.message.Message()
>>> m1["content-disposition"] = "attachment;
filename*=UTF-8''d%C3%A9jeuner.txt"
>>> m1.get_param("filename", header="content-disposition")
('UTF-8', '', 'd\xc3\xa9jeuner.txt')

If I use mailman's Message implementation, I get:

>>> m2 = mailman.email.message.Message()
>>> m2["content-disposition"] = "attachment;
filename*=UTF-8''d%C3%A9jeuner.txt"
>>> m2.get_param("filename", header="content-disposition")
(u'UTF-8', u'', u'd\xc3\xa9jeuner.txt')

The only difference is that the resulting tuple contains unicode strings.
Now when I try to pass this result to email.utils.collapse_rfc2231_value,
like Message.get_filename() does:

>>> p1 = m1.get_param("filename", header="content-disposition")
>>> email.utils.collapse_rfc2231_value(p1)
u'd\xe9jeuner.txt'

>>> p2 = m2.get_param("filename", header="content-disposition")
>>> email.utils.collapse_rfc2231_value(p2)
TypeError: decoding Unicode is not supported

And if I suppress the final decoding from what's already a unicode string,
I get
u'd\xc3\xa9jeuner.txt'

And, as you can see, the original u'd\xe9jeuner.txt' is not the same as
u'd\xc3\xa9jeuner.txt' (the second one is encoded twice).

In the end, our subclass of Message can't extract non-ascii filenames.
The bug 1060951 contains a testcase for this, it's a replacement for the
src/mailman/email/tests/test_message.py :
https://bugs.launchpad.net/mailman/+bug/1060951

I'm really interested in any insight on this issue. Thanks for reading all
that :-)

Aurélien