[Mailman-i18n] "Funny" characters in real names?

Barry A. Warsaw barry@zope.com
Sun, 15 Sep 2002 18:34:36 -0400


Thanks Martin, and everyone, I think I know what to do now.

>>>>> "MvL" =3D=3D Martin v L=F6wis <loewis@informatik.hu-berlin.de> wr=
ites:

    MvL> The RFC says, as a fall-back, the browser should use the
    MvL> encoding of the HTML page which contained the form. Mailman
    MvL> doesn't declare a charset in the administrative pages, but it
    MvL> should.

I think we'll make things simple and assume a list's preferred
language doesn't change between form presentation and submission.  So
if the page has a `lang' key (i.e. the listinfo page, or options
page), we'll use that, otherwise we'll fallback to the list's
preferred language.

If that language's charset is ascii or there are only ascii characters
in the name, we'll simple store the name unencoded in the user
database.  Otherwise, we'll encode the name to Unicode, and store that
along with the charset.  Then for email headers, we'll use the Header
class to encode the name in an RFC conformant way.  I don't think this
will be a huge amount of work, although it will require some changes
to the MemberAdaptor API.

(For command line, we'll still insist on ascii in the names, unless
there's a hue and cry -- or patch <wink> -- for something better.)

    MvL> It may happen that the user enters a character which cannot
    MvL> be represented in the charset of the page. In this case,
    MvL> Mozilla sends a '?' (question mark), so you can only tell
    MvL> that there was a character, but not which one. Internet
    MvL> Exploder sends a HTML entity, which gives you more
    MvL> information, but is undistinguishable from the case where the
    MvL> user entered an ampersand-digits sequence.

We won't handle these specially.  If that's what the browser gives us,
that's what we'll use.

    MvL> For Mailman, this gives two options:

    MvL> 1. Each administrative page should be encoded in the list's
    MvL> "native" charset. This will allow to add names in that
    MvL> charset.

    MvL> 2. Each page should be encoded in UTF-8. This will allow to
    MvL> enter arbitrary names, but will require recoding to the
    MvL> list's charset later (or using UTF-8 in the To: fields as
    MvL> well).

    MvL> Actually, it appears that mailman already does 1, in the HTTP
    MvL> header. Barry, what is the charset of your admin pages?

I had tried iso-8859-1 and us-ascii.  In us-ascii I got the HTML
entity, but in iso-8859-1 I got the actual character.

Let's go with #1.  Thanks.
-Barry