About Mailman's unicode-enabled Message subclass

Hi * !
I'm trying to fix bug 1060951 (that I reported), and no matter how I poke at it, I can't find the proper way to solve it, so I'd like to share my thoughts with you.
https://bugs.launchpad.net/mailman/+bug/1060951
The issue is with our unicode-enabled subclass of Python's email.message.Message class. Our subclass converts headers to unicode in the get() and __getitem__() methods. However, when an email has an attachment, one of the ways of representing the filename is to urlencode it (RFC 2231). The urldecoding is thus done on an unicode string, which is then passed to email.utils.collapse_rfc2231_value and decoded to unicode again. Of course, this fails, an unicode string can't be decoded twice.
I tried re-implementing more of the original Message class to avoid this second decoding, but then I get another problem: the filename has already been urldecoded. Here's an example to make it clearer:
If I use mailman's Message implementation, I get:
The only difference is that the resulting tuple contains unicode strings. Now when I try to pass this result to email.utils.collapse_rfc2231_value, like Message.get_filename() does:
And if I suppress the final decoding from what's already a unicode string, I get u'd\xc3\xa9jeuner.txt'
And, as you can see, the original u'd\xe9jeuner.txt' is not the same as u'd\xc3\xa9jeuner.txt' (the second one is encoded twice).
In the end, our subclass of Message can't extract non-ascii filenames. The bug 1060951 contains a testcase for this, it's a replacement for the src/mailman/email/tests/test_message.py : https://bugs.launchpad.net/mailman/+bug/1060951
I'm really interested in any insight on this issue. Thanks for reading all that :-)
Aurélien
participants (1)
-
Aurelien Bompard