[Mailman-i18n] "Funny" characters in real names?
Barry A. Warsaw
barry@zope.com
Sun, 15 Sep 2002 18:34:36 -0400
Thanks Martin, and everyone, I think I know what to do now.
>>>>> "MvL" =3D=3D Martin v L=F6wis <loewis@informatik.hu-berlin.de> wr=
ites:
MvL> The RFC says, as a fall-back, the browser should use the
MvL> encoding of the HTML page which contained the form. Mailman
MvL> doesn't declare a charset in the administrative pages, but it
MvL> should.
I think we'll make things simple and assume a list's preferred
language doesn't change between form presentation and submission. So
if the page has a `lang' key (i.e. the listinfo page, or options
page), we'll use that, otherwise we'll fallback to the list's
preferred language.
If that language's charset is ascii or there are only ascii characters
in the name, we'll simple store the name unencoded in the user
database. Otherwise, we'll encode the name to Unicode, and store that
along with the charset. Then for email headers, we'll use the Header
class to encode the name in an RFC conformant way. I don't think this
will be a huge amount of work, although it will require some changes
to the MemberAdaptor API.
(For command line, we'll still insist on ascii in the names, unless
there's a hue and cry -- or patch <wink> -- for something better.)
MvL> It may happen that the user enters a character which cannot
MvL> be represented in the charset of the page. In this case,
MvL> Mozilla sends a '?' (question mark), so you can only tell
MvL> that there was a character, but not which one. Internet
MvL> Exploder sends a HTML entity, which gives you more
MvL> information, but is undistinguishable from the case where the
MvL> user entered an ampersand-digits sequence.
We won't handle these specially. If that's what the browser gives us,
that's what we'll use.
MvL> For Mailman, this gives two options:
MvL> 1. Each administrative page should be encoded in the list's
MvL> "native" charset. This will allow to add names in that
MvL> charset.
MvL> 2. Each page should be encoded in UTF-8. This will allow to
MvL> enter arbitrary names, but will require recoding to the
MvL> list's charset later (or using UTF-8 in the To: fields as
MvL> well).
MvL> Actually, it appears that mailman already does 1, in the HTTP
MvL> header. Barry, what is the charset of your admin pages?
I had tried iso-8859-1 and us-ascii. In us-ascii I got the HTML
entity, but in iso-8859-1 I got the actual character.
Let's go with #1. Thanks.
-Barry