[Mailman-Developers] Re: handling multi-byte characters in templates

Sat, 21 Sep 2002 10:05:28 +0900

Jason R. Mastaler wrote:
> Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
> 
> 
>>Therefore, japanese messages are best treated
>>1. use euc-jp within internal process of messages and patterns.
>>2. convert the message charset from iso-2022-jp to euc-jp, when it
>>    first enter the processing pipeline.
>>3. convert again to iso-2022-jp when the message going out.
> 
> 
> Thank-you for the thorough explanation.  I have a few more questions
> about this.
> 
> Do you know what people use for Japanese character code conversion
> these days in Python?  I see that Mailman seems to be converting the
> templates from euc-jp to iso-2022-jp when sending out mail messages,
> but can't figure out where this is being done in the code.

Conversion is done when the internal crafted massage is genarated
by using Mailman.Message class which utilizes email package.
In email/Charset.py, defined are:

# Defaults
CHARSETS = {
     # input        header enc  body enc output conv
     'iso-8859-1':  (QP,        QP,      None),
     'iso-8859-2':  (QP,        QP,      None),
     'us-ascii':    (None,      None,    None),
     'big5':        (BASE64,    BASE64,  None),
     'gb2312':      (BASE64,    BASE64,  None),
     'euc-jp':      (BASE64,    None,    'iso-2022-jp'),
     'shift_jis':   (BASE64,    None,    'iso-2022-jp'),
     'iso-2022-jp': (BASE64,    None,    None),
     'koi8-r':      (BASE64,    BASE64,  None),
     'utf-8':       (BASE64,    BASE64,  'utf-8'),
     }

Read the source for more. These implementation are mostly done
by Ben Gertzfield and he is on this list I believe.

Unfortunately, these automatic conversion is for internaly generated
message only (I suppose) so I'm writing conversion modules for 2 and 3.
They are in http://mm.tkikuchi.net/mailman-2.1.ja/Mailman/Handlers/
which are pre-email style and need rewriting though.

> 
> Also, I found the JapaneseCodecs package for Python.  The README says:
> 
>   "By using this package, Japanese characters can be treated as a
>   character string instead of a byte sequence."

Looks like it mean "a 'unicode' character string."
> 
> This makes it seem like if I used JapaneseCodecs, no conversion would
> be necessary -- I just could store templates in iso-2022-jp and the
> special characters like `%' wouldn't interfere.  Does this sound
> right?

No. You must escape with '%%' or convert into 'euc-jp' or 'unicode'
before template substitutions, and un-escape or convert to 'iso-2022-jp'
after, of course.

-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/