[Mailman-Developers] I18n proposal

Mikhail Zabaluev mhz@alt-linux.org
Fri, 23 Nov 2001 02:13:54 +0300


Hello Ben,

On Wed, Nov 21, 2001 at 06:29:41PM +0900, Ben Gertzfield wrote:
>
>     Mikhail> The most serious bug I see here is that messages encoded
>     Mikhail> in base64 still get decorated with plaintext.
> 
> Headers or bodies? 

Oh, and headers hurt too, when someone replies with the mailing list
label in the subject hidden inside a base64 encoded word, and Mailman
slaps another label, ad infinitum. The subject context should be decoded
prior to searching for the label.

> Are you talking about the footer tacked on to the
> end of messages?  If so, it would be simple with the new message
> structure to make the footer be a separate text part.  Though, I
> don't see how adding some plain text after the end of the boundary
> could be corrupted; could you put an example corrupted message up?

Cannot find one right now, but I see them every now and then on our
Russian lists. Base64 is not a robust encoding; any non-base64 text
appended to a base64 stream produces garbage when decoded.
Decorating such messages with separate MIME part would be a better
solution than fiddle with decoding/recoding.

>     Mikhail> Another problem is encoded messages in archives. Heck,
>     Mikhail> look at this list's archive to see what I'm talking
>     Mikhail> about. Those should also be decoded and have character
>     Mikhail> set converted to some uniform one. I'd suggest UTF-8, but
>     Mikhail> many browsers and text viewers still don't grok this
>     Mikhail> charset, so it'd better be selectable as well.
> 
> I talked with Barry about this today.  My solution is to "guess" the
> character set based on whichever is most common in the archives, and
> use that as the charset specified in the HTML.

It's unreliable, can change over time, and will certainly cause
problems. Leave the administrator control over which charset his list
archives are served in. For storage, I'd still choose encoding
everything into UTF-8; this makes archives independent of the target
charset and resolves problems with multi-language messages.

> For any messages with
> multi-language subjects or bodies, the main language will be left
> in the normal character set, and the multi-language parts will be
> encoded with the UTF-8 HTML entity.

For starters, this could be done for all non-ASCII symbols.

> This will require Python unicode codecs for all our languages, which
> do not exist for KOI-8, Big5, or GB, as far as I know.

iconv-based codecs should exist for these; I must see.

-- 
Stay tuned,
  MhZ                                     JID: mookid@jabber.org
___________
The best audience is intelligent, well-educated and a little drunk.
		-- Maurice Baring