Mailman 3 Encoding problems - Mailman-Users

Encoding problems

older
List archive threading (pipermail)

Allan Odgaard

17 Jun 2008 17 Jun '08

2:53 a.m.

Some of my subscribers have accents and similar in their name and I
had to do the following post install to have Mailman properly work
with these:

## CLI

In order to get list_members -f «list» to properly output non-ASCII
user names I had to put the following:

 import sys
 sys.setdefaultencoding('utf-8')

Into /etc/python2.5/sitecustomize.py. This is despite proper setup
of LC_CTYPE on the system. Seems to me Mailman should use the
encoding of the current locale, not this site-wide Python default
encoding (settable by root only).

## Web

For the web page forms to accept non-ASCII I had to put this:

 add_language('en', 'English', 'utf-8')

Into /etc/mailman/mm_cfg.py. I think utf-8 should be the default
because even on an English list, you can use non-ASCII punctuation,
glyphs, and many European subscribers will have non-ASCII in their
names.

## Mailing List

The mailing list letters are correct _except_ that the body now
contains this:

 Content-Type: text/plain; charset="utf-8"
 Content-Transfer-Encoding: base64

And yes, each letter sent to the list is converted into base64.

I tried disabling the above utf-8 changes, but it did not seem to fix
it. But it might be that the list language (containing utf-8) was
copied at list creation time, so I will effectively have to recreate
the list (or write Python code) to change this?

Using Mailman 2.1.9 (Ubuntu installation).

Show replies by date

Brad Knowles

17 Jun 17 Jun

4:08 a.m.

On 6/17/08, Allan Odgaard wrote:

...

For the web page forms to accept non-ASCII I had to put this:
add_language('en', 'English', 'utf-8')
Into /etc/mailman/mm_cfg.py. I think utf-8 should be the default because even on an English list, you can use non-ASCII punctuation, glyphs, and many European subscribers will have non-ASCII in their names.

True enough, but the ISO Latin-1 encoding should be sufficient for most European subscribers. And many of the developers for Python in general and Mailman in particular are from European countries, so internationalization is an issue that we pay a lot of attention to.

In fact, waiting for i18n to catch up is what tends to delay our major releases by at least a month or two -- the basic code is there and ready to go, but the localization strings in the various languages aren't.

...

The mailing list letters are correct _except_ that the body now contains this:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

The charset is exactly what you previously created. The content-transfer-encoding is standard MIME encoding for non-ASCII charsets that cannot be represented in quoted-printable, or where more than a certain percentage of characters in a given message would have to be encoded.

There's absolutely nothing abnormal about that.

...

I tried disabling the above utf-8 changes, but it did not seem to fix it. But it might be that the list language (containing utf-8) was copied at list creation time, so I will effectively have to recreate the list (or write Python code) to change this?

The charset would have been set at list creation time, yes. But the content-transfer-encoding settings follows from standard practice for MIME encoding.

If you want to go fix the MIME encoding routines to work some other way, you're going to have a lot more heavy lifting to do. Among other things, all your changes to this code will get wiped out with the next upgrade, so you'll need to make sure you keep a copy to the side which you can use to re-patch the code to implement things the way you want them.

Alternatively, go talk to the authors of the MIME encoding routines and see if you can get them to change their minds.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Mark Sapiro

18 Jun 18 Jun

12:49 a.m.

Allan Odgaard wrote:

...

Some of my subscribers have accents and similar in their name and I
had to do the following post install to have Mailman properly work
with these:

## CLI

In order to get list_members -f «list» to properly output non-ASCII
user names I had to put the following:
import sys
sys.setdefaultencoding('utf-8')
Into /etc/python2.5/sitecustomize.py. This is despite proper setup
of LC_CTYPE on the system. Seems to me Mailman should use the
encoding of the current locale, not this site-wide Python default
encoding (settable by root only).

I am aware of this issue, but I think the place to fix it is Python, not Mailman.

...

## Web

For the web page forms to accept non-ASCII I had to put this:
add_language('en', 'English', 'utf-8')
Into /etc/mailman/mm_cfg.py. I think utf-8 should be the default
because even on an English list, you can use non-ASCII punctuation,
glyphs, and many European subscribers will have non-ASCII in their
names.

The web UI is scheduled for overhaul. This will probably be addressed.

...

## Mailing List

The mailing list letters are correct _except_ that the body now
contains this:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
And yes, each letter sent to the list is converted into base64.

Again, this is the Python email library. If you prefer quoted-printable, set the language's charset to iso-8859-1.

...

I tried disabling the above utf-8 changes, but it did not seem to fix
it. But it might be that the list language (containing utf-8) was
copied at list creation time, so I will effectively have to recreate
the list (or write Python code) to change this?

The list's preferred language was set at list create time and can be changed at any time thereafter, but nothing about that language's charset is in the list config.

Did you restart Mailman after removing

 add_language('en', 'English', 'utf-8')

from mm_cfg.py? If so, and you're still getting utf-8 encoded list mail, the incoming posts are probably utf-8 encoded.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Allan Odgaard

9:51 a.m.

On 18 Jun 2008, at 02:49, Mark Sapiro wrote:

...

...
[...] Into /etc/python2.5/sitecustomize.py. This is despite proper setup of LC_CTYPE on the system. Seems to me Mailman should use the encoding of the current locale, not this site-wide Python default encoding (settable by root only). I am aware of this issue, but I think the place to fix it is Python, not Mailman.

I don’t think Mailman should use sys.getdefaultencoding(). See <http://wiki.python.org/moin/DefaultEncoding

...

. I think instead locale.getdefaultlocale() should be used for the
CLI commands.

...

...
## Mailing List

The mailing list letters are correct _except_ that the body now contains this:

Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64

And yes, each letter sent to the list is converted into base64.

Again, this is the Python email library. If you prefer quoted-printable, set the language's charset to iso-8859-1.

But I occasionally use stuff that cannot be represented in latin-1…
(like this line)

Anyway, point taken, Python MIME library is at fault and I will direct
my concern to the maintainers of this.

Though I don’t understand why Mailman has to re-encode all letters to
the list encoding. Hopefully it will only do so when the list encoding
can actually represent the letter!?!

...

[...] Did you restart Mailman after removing
add_language('en', 'English', 'utf-8')
from mm_cfg.py? If so, and you're still getting utf-8 encoded list mail, the incoming posts are probably utf-8 encoded.

I must have not restarted at the proper time while testing, cause now
I can get it work, i.e. messages now arrive without being base64.

Mark Sapiro

2:09 p.m.

Allan Odgaard" wrote:

...

I dont think Mailman should use sys.getdefaultencoding(). See <http://wiki.python.org/moin/DefaultEncoding>. I think instead locale.getdefaultlocale() should be used for the
CLI commands.

Since sys.getdefaultencoding() is going away, we'll have to do something ;)

[...]

...

Though I dont understand why Mailman has to re-encode all letters to
the list encoding. Hopefully it will only do so when the list encoding
can actually represent the letter!?!

The process of adding the list header and/or footer to the message attempts to add these to a text/plain body by coercing the body and the header/footer to unicode, concatenating them and then coercing back to the original body charset. If the last step doesn't work, it will try to coerce to the charset of the list's preferred language.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Allan Odgaard

5:14 p.m.

On 18 Jun 2008, at 16:09, Mark Sapiro wrote:

...

[...] The process of adding the list header and/or footer to the message attempts to add these to a text/plain body by coercing the body and the header/footer to unicode, concatenating them and then coercing back to the original body charset. If the last step doesn't work, it will try to coerce to the charset of the list's preferred language.

My list header/footer is pure ASCII. So there should never be a
problem going back to the original body encoding.

So should I consider it a bug that setting list encoding to utf-8 will
(in my experience) _always_ produce (base 64 encoded) utf-8 letters,
when both header/footer and letter itself sent to list is ASCII?

Here is what I did to test: Set list encoding to utf-8 (in mm_cfg.py).
Created a new list (called Test) and subscribed to it. Even the
welcome letter contained:

 Content-Type: text/plain; charset="utf-8"
 Content-Transfer-Encoding: base64
 Subject: =?utf-8?q?Welcome_to_the_=22Test=22_mailing_list?=

Notice here the subject was sent as utf-8 even though it is pure
ASCII, the body was likewise encoded.

I then sent one letter to this list, subject set to Test and body
likewise. I ensured that the produced letter was sent with encoding
set to ASCII. The letter I received as a subscriber was one giant
block of base64 encoded text with charset set to utf-8.

Mark Sapiro

10:24 p.m.

Allan Odgaard wrote:

...

On 18 Jun 2008, at 16:09, Mark Sapiro wrote:

...
[...] The process of adding the list header and/or footer to the message attempts to add these to a text/plain body by coercing the body and the header/footer to unicode, concatenating them and then coercing back to the original body charset. If the last step doesn't work, it will try to coerce to the charset of the list's preferred language.

My list header/footer is pure ASCII. So there should never be a
problem going back to the original body encoding.

Actually, I misspoke above. The preferred encoding is that of the list's preferred language. The incoming message encoding is the fallback.

...

So should I consider it a bug that setting list encoding to utf-8 will
(in my experience) _always_ produce (base 64 encoded) utf-8 letters,
when both header/footer and letter itself sent to list is ASCII?

You may consider it a bug if you wish. It is intentional (but still perhaps wrong) that the message is coerced to the character set of the list's preferred language when msg_header and/or msg_footer are added.

The base64 encoding for a utf-8 message is a separate issue and is done by the Python email library.

...

Here is what I did to test: Set list encoding to utf-8 (in mm_cfg.py).
Created a new list (called Test) and subscribed to it. Even the
welcome letter contained:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Subject: =?utf-8?q?Welcome_to_the_=22Test=22_mailing_list?=

This is a 'virgin' message from Mailman which will always be in the charset of the list's or user's preferred language, so no surprise here.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Allan Odgaard

19 Jun 19 Jun

4:21 a.m.

On 19 Jun 2008, at 00:24, Mark Sapiro wrote:

...

...
So should I consider it a bug that setting list encoding to utf-8
will (in my experience) _always_ produce (base 64 encoded) utf-8 letters, when both header/footer and letter itself sent to list is ASCII?

You may consider it a bug if you wish. It is intentional (but still perhaps wrong) that the message is coerced to the character set of the list's preferred language when msg_header and/or msg_footer are added.

I meant in the sense that I should report it / submit a patch, given
that what I saw was contrary to what you said the behavior should be.
Though given your correction, I assume I should not submit a patch for
this, if it actually is how it is supposed to work.

...

The base64 encoding for a utf-8 message is a separate issue and is
done by the Python email library.

...
Here is what I did to test: Set list encoding to utf-8 (in
mm_cfg.py). Created a new list (called Test) and subscribed to it. Even the welcome letter contained:

Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Subject: =?utf-8?q?Welcome_to_the_=22Test=22_mailing_list?=

This is a 'virgin' message from Mailman which will always be in the charset of the list's or user's preferred language, so no surprise here.

But encoding an ASCII subject? Though maybe this is also the Python
MIME library? I will do some tests with the lib.

Mark Sapiro

2:03 p.m.

Allan Odgaard wrote:

...

On 19 Jun 2008, at 00:24, Mark Sapiro wrote:

...
You may consider it a bug if you wish. It is intentional (but still perhaps wrong) that the message is coerced to the character set of the list's preferred language when msg_header and/or msg_footer are added.

I meant in the sense that I should report it / submit a patch, given
that what I saw was contrary to what you said the behavior should be.
Though given your correction, I assume I should not submit a patch for
this, if it actually is how it is supposed to work.

I'm open to the idea of changing it, but not without input from people with experience with asian language lists. The original code comes from someone with experience with Japanese language lists, and I think there is likely good reason for doing it the way it's done.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

6048

Age (days ago)

6050

Last active (days ago)

List overview

Download

8 comments

3 participants

participants (3)

Allan Odgaard
Brad Knowles
Mark Sapiro

Encoding problems

tags

participants (3)