[Python-Dev] Re: [I18n-sig] Planned updates for cjkcodecs before 2.4a1

Wed Jun 16 05:33:59 EDT 2004

Hye-Shik Chang wrote:
> I have planned few things to update in cjkcodecs before 2.4 alpha1
> is out.  If you have any opionions or objections, please tell me.
> 
> 1. Update JIS X 0213 to its first amendment (a.k.a JIS X 0213:2004)
>    This will introduce three new encodings; euc-jis-2004, shift_jis-2004
>    and iso-2022-jp-2004.  It's not so different from their each
>    preceding encodings but we may need to keep both of versions due
>    to incompatibilities and encoding name change.  (This won't bloat
>    code size a lot. I expect it around 3~5K.)

+1

> 2. Merge two or three simliar C codecs into one.  We have one C
>    codec for every each python codecs currently.  I have got an
>    idea to merge them into several similar groups and many common
>    part of .so binaries will be saved:
> 
>      _codecs_jacodecs_1.so: euc-jp, shift-jis, iso-2022-jp,
>                             iso-2022-jp-1, iso-2022-jp-ext
>      _codecs_jacodecs_2.so: euc-jisx0213, shift-jisx0213, iso-2022-jp-3,
> 			    euc-jis-2004, shift-jis-2004,
> 			    iso-2022-jp-2004
>      _codecs_jacodecs_3.so: iso-2022-jp-2
>      _codecs_kocodecs_1.so: euc-kr, johab, iso-2022-kr
>      _codecs_kocodecs_2.so: cp949
>      _codecs_zhcodecs_1.so: gb2312, gbk, gb18030, hz
>      _codecs_zhcodecs_2.so: big5, cp950

+1, but why not put all Japanese codecs into one module and
dito for the Korean and Chinese ones ?

Note that todays OS linkers will only mmap those pieces
of code into the process memory that are actually needed
by the application, so even though the size of the modules
increases, the application process memory foot-print is
likely not to increase.

> 3. Split some mapping keeper modules to few group-based modules. This
>    will save memory and spaces for who need only legacy codecs like
>    "euc-kr only".
> 
>      _codecs_mapdata_ko_KR ->
>          _codecs_komapdata_1.so: KS X 1001
>          _codecs_komapdata_2.so: cp949
> 
>      _codecs_mapdata_ja_JP ->
>          _codecs_jamapdata_1.so: JIS X 0208, JIS X 0212
>          _codecs_jamapdata_2.so: JIS X 0213:2000 and :2004
> 
>      _codecs_mapdata_zh_CN ->
>          _codecs_zhmapdata_1.so: gb2312, gbk, gb18030
> 
>      _codecs_mapdata_zh_TW ->
>          _codecs_zhmapdata_2.so: big5, cp950
> 

-1

See above: this is static C data, so splitting these won't
really buy the user anything.

If you don't believe this, compare the resident size of
Python with and without unicodedata loaded. The difference
on my machine is a measily 30kB, not the 250kB of the complete
module.

> If these sound acceptable for python-dev people, they will be
> implemented as CJKCodecs 1.1 first and imported into python later
> (before 2.4a1).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 16 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::