[Mailman-Developers] handling multi-byte characters in templates

Tokio Kikuchi tkikuchi@is.kochi-u.ac.jp
Fri, 20 Sep 2002 09:28:34 +0900


Jason,

Japanese is the most difficult language when you internationalize
applications. ;-)

1. it is multibyte
2. there are three coding schemes (although standad is one: JIS)
     they are iso-2022-jp, shift-jis, and euc-jp
3. iso-2022-jp is used for mail and news messages.
     you will hear many complaints if you use other code even
     if it is followed by MIME scheme.
4. because iso-2022-jp is 7bit, it contains many special
     characters like \,%,&,... (they are ESCaped)
5. among the three, euc-jp is the best for using in programming
     because all the japanese characters are msb set 1.
     (like UTF-8)

Therefore, japanese messages are best treated
1. use euc-jp within internal process of messages and patterns.
2. convert the message charset from iso-2022-jp to euc-jp, when it
    first enter the processing pipeline.
3. convert again to iso-2022-jp when the message going out.

Jason R. Mastaler wrote:
> When Mailman.Utils.maketext() does string substitution in a template
> containing multi-byte characters (such as in templates/ja/), how does
> it avoid errors during dictionary interpolation?

euc-jp is used in the templates.
> 
> TMDA is using a nearly identical function to make text from templates,
> but certain multi-byte characters (Japanese in particular) in the
> templates trigger the following exceptions:
> 
>           ValueError: incomplete format key
> 
>           TypeError: not enough arguments for format string
> 
> Someone suggested that the Japanese text probably has characters in it
> that include an ascii % as part of the multi-byte character.
> 
> I'm wondering how Mailman gets around this problem.
> 


-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/