Proposal: option for UTF-8 emails without base64 encoding
Hi all,
due to Python defaults, mailman exhibits strange behaviour when processing UTF-8 emails. When no header/footer is configured, mailman passes UTF-8 emails in original form, i.e. 8bit. However, when either header or footer is configured in mailman, it uses Python's libraries to add them and as a side effect it converts 8bit emails into 7bit base64 encoded ones.
This is highly undesirable in some cases. For instance, mailinglist might be used to distribute trouble tickets or other content which is expected to be easily parsable by automated text-based utilities. With base64, emails grow in size by 33 % and such emails are getting much higher spam scores since base64 is typically used by spammers to obfuscate the payload. There are of course much more reasons for not using base64 as the primary encoding method for UTF-8 email.
The fix is quite simple and is already widely used by other projects. All that needs to be done is to redefine Python's UTF-8 charset properties, i.e. in every place where you have
from email.Charset import Charset
you need to add:
email.Charset.add_charset('utf-8',email.Charset.SHORTEST, None, None)
With such setting, mailman will keep the 8bit encoding also when it's adding header/footer and won't downconvert to 7bit+base64. So I'd like to propose the above addition, at least as a configurable option if there's any fear that enabling it by default could cause some problems.
Thanks, Petr
On Fri, May 01, 2009 at 09:01:32AM +0200, Petr Hroudný wrote:
With base64, emails grow in size by 33 % and such emails are getting much higher spam scores since base64 is typically used by spammers to obfuscate the payload. There are of course much more reasons for not using base64 as the primary encoding method for UTF-8 email.
+1 for the proposed switch away from base64. This will be a boon for ibiblio lists.
Thanks,
Cristóbal Palmer ibiblio.org systems administrator cdla.unc.edu research assistant
On May 1, 2009, at 3:01 AM, Petr Hroudný wrote:
due to Python defaults, mailman exhibits strange behaviour when processing UTF-8 emails. When no header/footer is configured, mailman passes UTF-8 emails in original form, i.e. 8bit. However, when either header or footer is configured in mailman, it uses Python's libraries to add them and as a side effect it converts 8bit emails into 7bit base64 encoded ones.
Just to be clear, we're talking about headers and footers added by
appending the text to the main body, right? When headers and footers
are attached via MIME, Mailman always does the right thing?
This is highly undesirable in some cases. For instance, mailinglist might be used to distribute trouble tickets or other content which is expected to be easily parsable by automated text-based utilities. With base64, emails grow in size by 33 % and such emails are getting much higher spam scores since base64 is typically used by spammers to obfuscate the payload. There are of course much more reasons for not using base64 as the primary encoding method for UTF-8 email.
One option of course would be to simply not add headers and footers
for these types of mailing lists. Another option, which I proposed in
bug 373083, is to add an option to force a mailing list to always
attach headers and footers as MIME. You risk some users not seeing
them because their MUA doesn't display them inline and they don't know
how to click on the attachments, but I think that's an acceptable risk
for the minority of mailing lists that care about this. At least for
MM2.1.
The fix is quite simple and is already widely used by other projects. All that needs to be done is to redefine Python's UTF-8 charset properties, i.e. in every place where you have
from email.Charset import Charset
you need to add:
email.Charset.add_charset('utf-8',email.Charset.SHORTEST, None, None)
With such setting, mailman will keep the 8bit encoding also when it's adding header/footer and won't downconvert to 7bit+base64. So I'd like to propose the above addition, at least as a configurable option if there's any fear that enabling it by default could cause some
problems.
I'll note that you don't need to change Mailman at all do to this.
You simply need to add this to your mm_cfg.py file and I'll bet it
will just work for you.
-Barry
2009/5/15 Barry Warsaw <barry@list.org>:
The fix is quite simple and is already widely used by other projects. All that needs to be done is to redefine Python's UTF-8 charset properties, i.e. in every place where you have
from email.Charset import Charset
you need to add:
email.Charset.add_charset('utf-8',email.Charset.SHORTEST, None, None)
With such setting, mailman will keep the 8bit encoding also when it's adding header/footer and won't downconvert to 7bit+base64. So I'd like to propose the above addition, at least as a configurable option if there's any fear that enabling it by default could cause some problems.
I'll note that you don't need to change Mailman at all do to this. You simply need to add this to your mm_cfg.py file and I'll bet it will just work for you.
It works indeed! Just a slightly different syntax needs to be used:
import email.Charset email.Charset.add_charset('utf-8',email.Charset.SHORTEST, None, None)
It might probably make sense to document this somewhere, or perhaps introduce a new config option for MM2.2 which will do this. Almost all MTAs today are 8bit clean, so having an option to work in 8bit mode will surely be attractive for many.
Thanks, Petr
P.S. I responded to your other comments into bug #373083 notes.
participants (3)
-
Barry Warsaw
-
Cristóbal Palmer
-
Petr Hroudný