Mailman 3 ISO-8859-1/Latin1 vs UTF-8 - Mailman-Users

ISO-8859-1/Latin1 vs UTF-8

Bernd Petrovitsch

Oct. 23, 2005

2:58 p.m.

Apparently all of the German translation of Mailman is in ISO-8859-1 (or ISO-8859-15) - at least in the standard Debian mailman package.

Is there a special reason for not moving to UTF-8 in general?

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Show replies by date

Hannah Schroeter

October 2005

3:52 a.m.

Hello!

On Sun, Oct 23, 2005 at 11:58:37PM +0200, Bernd Petrovitsch wrote:

...

Apparently all of the German translation of Mailman is in ISO-8859-1 (or ISO-8859-15) - at least in the standard Debian mailman package.

...

Is there a special reason for not moving to UTF-8 in general?

I'd say: YAGNI (Ya ain't gonna need it). If the charset is declared correctly, it isn't worse for German text. And it has less overhead then.

...

Bernd

Kind regards,

Hannah.

Bernd Petrovitsch

4:04 a.m.

On Mon, 2005-10-24 at 12:52 +0200, Hannah Schroeter wrote: [...]

...

If I don't need it, I do not care.

...

correctly, it isn't worse for German text. And it has less overhead then.

The problem is that I enter text in the web-interface on a default UTF-8 system and it is apparently stored as UTF-8. The pages are delivered as ISO-8859-1 according to the HTTP header and the header in the file. So the CGI scripts actually should convert correctly the ML-admins data from the given charset into ISO-8859-1 (which is not 100% possible but for German and in practice it will be good enough IMO). The other solution is to use ö and brothers and blame the ML admin if he doesn't do so (and put in somwhere into the docs or so).

There is BTW a similar problem: If I enter " (quotes) in the web interface, they are converted to " (probably by the browser - I didn't check) and then sent out in a plain/text email.

So in both cases I actually want to know what the way to go is.

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Brad Knowles

7:19 a.m.

At 1:04 PM +0200 2005-10-24, Bernd Petrovitsch wrote:

...

These two are actually a result of the same thing -- the Mailman

code scrubs input text from the web interface, to try to prevent cross-site scripting attacks. If you want this stuff to show up correctly, you will need to edit the template files directly as opposed to using the web interface.

If you go back and use the web interface to edit that text again,

it will get re-scrubbed, so once you go the template route, you need to stick with editing those files directly.

-- Brad Knowles, <brad@stop.mail-abuse.org>

"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."

 -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
 Assembly to the Governor, November 11, 1755

SAGE member since 1995. See <http://www.sage.org/> for more info.

Mark Sapiro

9:52 a.m.

Bernd Petrovitsch wrote:

...

As Brad points out in another reply, some of this problem is because all text entered in the web interface (except for General Options->info which is a special case) is HTML escaped to prevent XSS attacks. Mailman arguably goes overboard on this, but the 4 characters '&' '<' '>' and '"' are changed respectively to &, <, > and " by Python's cgi.escape() method.

Thus, you can't even enter ö and have it work in HTML or plain text.

You can convert Mailman to use utf-8 for German language, but this will not solve the html escaping issue. If you are interested in converting to utf-8, there is relevant information in the archives of this list. See the thread beginning at http://mail.python.org/pipermail/mailman-users/2005-September/046467.html and continued at http://mail.python.org/pipermail/mailman-users/2005-September/046480.html and the thread beginning at http://mail.python.org/pipermail/mailman-users/2005-October/046850.html and continued at http://mail.python.org/pipermail/mailman-users/2005-October/046883.html and http://mail.python.org/pipermail/mailman-users/2005-October/046938.html

-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Bernd Petrovitsch

11:05 a.m.

On Mon, 2005-10-24 at 09:52 -0700, Mark Sapiro wrote: [...]

...

As Brad points out in another reply, some of this problem is because all text entered in the web interface (except for General Options->info which is a special case) is HTML escaped to prevent XSS attacks. Mailman arguably goes overboard on this, but the 4 characters

Which is a good thing.

...

'&' '<' '>' and '"' are changed respectively to &, <, > and " by Python's cgi.escape() method.

Makes sense. Hmm, mailman could replace that four chars with the ASCII chars just for plain/text parts od sent out emails. That should not open any security hole and yield real plain/text.

...

Thus, you can't even enter ö and have it work in HTML or plain text.

Yes, of course, these are two distinct issues. Sorry for confusion.

...

You can convert Mailman to use utf-8 for German language, but this will not solve the html escaping issue. If you are interested in converting to utf-8, there is relevant information in the archives of this list.

I actually reported a bug (though it may not sound so): I enter (apparently) UTF-8 text (with Firefox it that is important) and it comes back disguised (and as part of) ISO-8859-1 text. The question is: Which part is doing something wrong and how to fix it?

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Mark Sapiro

2:05 p.m.

Bernd Petrovitsch wrote:

...

What happens here is that Mailman creates the web page with the META tag in the header

where xxxx is the encoding of the language of the list (default iso-8859-1 for German), but the web server sends its own http Content-Type: header specifying charset=utf-8. For reasons I don't understand, the HTML standard says the server provided Content-Type: charset takes priority over that specified by an HTML META tag.

Thus your browser sets it's encoding as utf-8, but mailman thinks what it gets back is iso-8859-1 and thus garbles the multibyte unicode sequences.

It can be fixed by setting the 'German' character set to utf-8 and recoding the German language templates, messages and list archives in utf-8 as discussed in the archive threads I mentioned previously.

Alternatively, it can be addressed in the web server by configuring it so it doesn't specify these documents as utf-8.

-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Bernd Petrovitsch

2:45 p.m.

On Mon, 2005-10-24 at 14:05 -0700, Mark Sapiro wrote:

...

Bernd Petrovitsch wrote:

...
I actually reported a bug (though it may not sound so): I enter (apparently) UTF-8 text (with Firefox it that is important) and it comes back disguised (and as part of) ISO-8859-1 text. The question is: Which part is doing something wrong and how to fix it?

What happens here is that Mailman creates the web page with the META tag in the header

<META http-equiv="Content-Type" content="text/html; charset=xxxx">

where xxxx is the encoding of the language of the list (default iso-8859-1 for German), but the web server sends its own http Content-Type: header specifying charset=utf-8. For reasons I don't understand, the HTML standard says the server provided Content-Type: charset takes priority over that specified by an HTML META tag.

I don't understand it either but it is so. BTW I usually disable the feature in the webserver config.

...

Thus your browser sets it's encoding as utf-8, but mailman thinks what it gets back is iso-8859-1 and thus garbles the multibyte unicode sequences.

It can be fixed by setting the 'German' character set to utf-8 and recoding the German language templates, messages and list archives in utf-8 as discussed in the archive threads I mentioned previously.

Done. I have now a German and an English template both specifying UTF-8 as charset *and* UTF-8 text in there (especially in the German one). But the crazy thing ist that the English page is - according to "Page Info" in Firefox and on the shell with wget --post-data="language=de" -S https://lists.funkfeuer.at/mailman/listinfo/user - delivered as "UTF-8" and the German one as "ISO-8859-1" as you (and everybody else) can see on https://lists.funkfeuer.at/mailman/listinfo/user. The German summary on both pages has been entered through the web interface of the list administrator.

...

Alternatively, it can be addressed in the web server by configuring it so it doesn't specify these documents as utf-8.

This is IMHO the case. ---- snip ---- 711#grep AddDef /etc/apache2/apache2.conf AddDefaultCharset off ---- snip ----

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Stephen J. Turnbull

November 2005

12:53 a.m.

...

...
...
...
...
"Bernd" == Bernd Petrovitsch <bernd@firmix.at> writes:

>> Content-Type: header specifying charset=utf-8. For reasons I
>> don't understand, the HTML standard says the server provided
>> Content-Type: charset takes priority over that specified by an
>> HTML META tag.

Bernd> I don't understand it either but it is so.

One reason is that the server may very well translate the encoding based on negotiation with the client. (I guess you could argue that it should remove the charset attribute from the META tag if it does.)

A second reason is that admins will occasionally translate encodings and not even be aware that some users who are too smart for their own good have used META tags.

Bernd> BTW I usually disable the feature in the webserver config.

While I think the standard got the precedence right, this feature should be disabled by default.

-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Bernd Petrovitsch

4:32 a.m.

Sorry for wrong threading but I accidentally deleted the last email here:

...

One reason is that the server may very well translate the encoding based on negotiation with the client. (I guess you could argue that

Yes, but *if* the encoding is negotatied, a default value makes not that much sense (apart from the situation that absolutely no negotiation takes place).

...

it should remove the charset attribute from the META tag if it does.)

Technically: *If* the webserver hands out a file and tells explicitly it's encoding, it must have it right. If that means "parse the .html file, find appropriate headers, convert the file and rewrite all relevant headers etc." than he absolutely must do it. At least in Apache (1 and 2) there is ATM no mechanism in there which tries to do that AFAIK.

And you are right with the unquoted part: A default makes here absolutely no sense.

...

A second reason is that admins will occasionally translate encodings and not even be aware that some users who are too smart for their own good have used META tags.

Well, but the real reason for the problem is somewhere else - namely in between the admin and the users. And we have probably now the situation that most of the .html producing users and tools are dumb ebough to actually need such crazy options ....

BTW I wonder why there are no "check the files on a webspace" scripts out there which simply check this ...

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Hannah Schroeter

October 2005

10:52 a.m.

Hello!

On Sun, Oct 23, 2005 at 11:58:37PM +0200, Bernd Petrovitsch wrote:

...

Apparently all of the German translation of Mailman is in ISO-8859-1 (or ISO-8859-15) - at least in the standard Debian mailman package.

...

Is there a special reason for not moving to UTF-8 in general?

I'd say: YAGNI (Ya ain't gonna need it). If the charset is declared correctly, it isn't worse for German text. And it has less overhead then.

...

Bernd

Kind regards,

Hannah.

Bernd Petrovitsch

11:04 a.m.

On Mon, 2005-10-24 at 12:52 +0200, Hannah Schroeter wrote: [...]

...

If I don't need it, I do not care.

...

correctly, it isn't worse for German text. And it has less overhead then.

There is BTW a similar problem: If I enter " (quotes) in the web interface, they are converted to " (probably by the browser - I didn't check) and then sent out in a plain/text email.

So in both cases I actually want to know what the way to go is.

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Brad Knowles

2:19 p.m.

At 1:04 PM +0200 2005-10-24, Bernd Petrovitsch wrote:

...

These two are actually a result of the same thing -- the Mailman

If you go back and use the web interface to edit that text again,

it will get re-scrubbed, so once you go the template route, you need to stick with editing those files directly.

-- Brad Knowles, <brad@stop.mail-abuse.org>

"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."

 -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
 Assembly to the Governor, November 11, 1755

SAGE member since 1995. See <http://www.sage.org/> for more info.

Mark Sapiro

4:52 p.m.

Bernd Petrovitsch wrote:

...

Thus, you can't even enter ö and have it work in HTML or plain text.

-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Bernd Petrovitsch

6:05 p.m.

On Mon, 2005-10-24 at 09:52 -0700, Mark Sapiro wrote: [...]

...

As Brad points out in another reply, some of this problem is because all text entered in the web interface (except for General Options->info which is a special case) is HTML escaped to prevent XSS attacks. Mailman arguably goes overboard on this, but the 4 characters

Which is a good thing.

...

'&' '<' '>' and '"' are changed respectively to &, <, > and " by Python's cgi.escape() method.

Makes sense. Hmm, mailman could replace that four chars with the ASCII chars just for plain/text parts od sent out emails. That should not open any security hole and yield real plain/text.

...

Thus, you can't even enter ö and have it work in HTML or plain text.

Yes, of course, these are two distinct issues. Sorry for confusion.

...

You can convert Mailman to use utf-8 for German language, but this will not solve the html escaping issue. If you are interested in converting to utf-8, there is relevant information in the archives of this list.

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Mark Sapiro

9:05 p.m.

Bernd Petrovitsch wrote:

...

What happens here is that Mailman creates the web page with the META tag in the header

Thus your browser sets it's encoding as utf-8, but mailman thinks what it gets back is iso-8859-1 and thus garbles the multibyte unicode sequences.

Alternatively, it can be addressed in the web server by configuring it so it doesn't specify these documents as utf-8.

-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Bernd Petrovitsch

October 2005

2:45 p.m.

On Mon, 2005-10-24 at 14:05 -0700, Mark Sapiro wrote:

...

Bernd Petrovitsch wrote:

...
I actually reported a bug (though it may not sound so): I enter (apparently) UTF-8 text (with Firefox it that is important) and it comes back disguised (and as part of) ISO-8859-1 text. The question is: Which part is doing something wrong and how to fix it?

What happens here is that Mailman creates the web page with the META tag in the header

<META http-equiv="Content-Type" content="text/html; charset=xxxx">

where xxxx is the encoding of the language of the list (default iso-8859-1 for German), but the web server sends its own http Content-Type: header specifying charset=utf-8. For reasons I don't understand, the HTML standard says the server provided Content-Type: charset takes priority over that specified by an HTML META tag.

I don't understand it either but it is so. BTW I usually disable the feature in the webserver config.

...

Thus your browser sets it's encoding as utf-8, but mailman thinks what it gets back is iso-8859-1 and thus garbles the multibyte unicode sequences.

It can be fixed by setting the 'German' character set to utf-8 and recoding the German language templates, messages and list archives in utf-8 as discussed in the archive threads I mentioned previously.

...

Alternatively, it can be addressed in the web server by configuring it so it doesn't specify these documents as utf-8.

This is IMHO the case. ---- snip ---- 711#grep AddDef /etc/apache2/apache2.conf AddDefaultCharset off ---- snip ----

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

Stephen J. Turnbull

November 2005

12:53 a.m.

...

...
...
...
...
"Bernd" == Bernd Petrovitsch <bernd@firmix.at> writes:

>> Content-Type: header specifying charset=utf-8. For reasons I
>> don't understand, the HTML standard says the server provided
>> Content-Type: charset takes priority over that specified by an
>> HTML META tag.

Bernd> I don't understand it either but it is so.

A second reason is that admins will occasionally translate encodings and not even be aware that some users who are too smart for their own good have used META tags.

Bernd> BTW I usually disable the feature in the webserver config.

While I think the standard got the precedence right, this feature should be disabled by default.

Bernd Petrovitsch

4:32 a.m.

Sorry for wrong threading but I accidentally deleted the last email here:

...

One reason is that the server may very well translate the encoding based on negotiation with the client. (I guess you could argue that

Yes, but *if* the encoding is negotatied, a default value makes not that much sense (apart from the situation that absolutely no negotiation takes place).

...

it should remove the charset attribute from the META tag if it does.)

And you are right with the unquoted part: A default makes here absolutely no sense.

...

A second reason is that admins will occasionally translate encodings and not even be aware that some users who are too smart for their own good have used META tags.

BTW I wonder why there are no "check the files on a webspace" scripts out there which simply check this ...

Bernd

-- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services

7048

Age (days ago)

7057

Last active (days ago)

List overview

Download

9 comments

5 participants

participants (5)

Bernd Petrovitsch
Brad Knowles
Hannah Schroeter
Mark Sapiro
Stephen J. Turnbull