[Mailman-i18n] "Funny" characters in real names?

Martin v. L÷wis loewis@informatik.hu-berlin.de
15 Sep 2002 19:38:05 +0200

barry@zope.com (Barry A. Warsaw) writes:

> I must be dense because I'm not quite seeing how this will work.
> This doesn't tell me enough either does it?

You are running into one of the most awful oddities of HTTP and
i18n. In short, the encoding of the page that contained the form was
used to encoding the form contents :-(

The RFC says the browser SHOULD declare the encoding for each field in
the per-field MIME header of multipart/form-data message. None of the
browsers does that. I filed bug reports for all of them, and Mozilla
people responded that they can't do that because many CGI scripts
break when they get a charset= (it won't fit their regexp).

The RFC says, as a fall-back, the browser should use the encoding of
the HTML page which contained the form. Mailman doesn't declare a
charset in the administrative pages, but it should.

It may happen that the user enters a character which cannot be
represented in the charset of the page. In this case, Mozilla sends a
'?' (question mark), so you can only tell that there was a character,
but not which one. Internet Exploder sends a HTML entity, which gives
you more information, but is undistinguishable from the case where the
user entered an ampersand-digits sequence.

For Mailman, this gives two options:

1. Each administrative page should be encoded in the list's "native"
   charset. This will allow to add names in that charset.

2. Each page should be encoded in UTF-8. This will allow to enter
   arbitrary names, but will require recoding to the list's charset
   later (or using UTF-8 in the To: fields as well).

Actually, it appears that mailman already does 1, in the HTTP
header. Barry, what is the charset of your admin pages?