I have planned few things to update in cjkcodecs before 2.4 alpha1 is out. If you have any opionions or objections, please tell me. 1. Update JIS X 0213 to its first amendment (a.k.a JIS X 0213:2004) This will introduce three new encodings; euc-jis-2004, shift_jis-2004 and iso-2022-jp-2004. It's not so different from their each preceding encodings but we may need to keep both of versions due to incompatibilities and encoding name change. (This won't bloat code size a lot. I expect it around 3~5K.) 2. Merge two or three simliar C codecs into one. We have one C codec for every each python codecs currently. I have got an idea to merge them into several similar groups and many common part of .so binaries will be saved: _codecs_jacodecs_1.so: euc-jp, shift-jis, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-ext _codecs_jacodecs_2.so: euc-jisx0213, shift-jisx0213, iso-2022-jp-3, euc-jis-2004, shift-jis-2004, iso-2022-jp-2004 _codecs_jacodecs_3.so: iso-2022-jp-2 _codecs_kocodecs_1.so: euc-kr, johab, iso-2022-kr _codecs_kocodecs_2.so: cp949 _codecs_zhcodecs_1.so: gb2312, gbk, gb18030, hz _codecs_zhcodecs_2.so: big5, cp950 3. Split some mapping keeper modules to few group-based modules. This will save memory and spaces for who need only legacy codecs like "euc-kr only". _codecs_mapdata_ko_KR -> _codecs_komapdata_1.so: KS X 1001 _codecs_komapdata_2.so: cp949 _codecs_mapdata_ja_JP -> _codecs_jamapdata_1.so: JIS X 0208, JIS X 0212 _codecs_jamapdata_2.so: JIS X 0213:2000 and :2004 _codecs_mapdata_zh_CN -> _codecs_zhmapdata_1.so: gb2312, gbk, gb18030 _codecs_mapdata_zh_TW -> _codecs_zhmapdata_2.so: big5, cp950 If these sound acceptable for python-dev people, they will be implemented as CJKCodecs 1.1 first and imported into python later (before 2.4a1). Hye-Shik
Hye-Shik Chang wrote:
I have planned few things to update in cjkcodecs before 2.4 alpha1 is out. If you have any opionions or objections, please tell me.
1. Update JIS X 0213 to its first amendment (a.k.a JIS X 0213:2004) This will introduce three new encodings; euc-jis-2004, shift_jis-2004 and iso-2022-jp-2004. It's not so different from their each preceding encodings but we may need to keep both of versions due to incompatibilities and encoding name change. (This won't bloat code size a lot. I expect it around 3~5K.)
+1
2. Merge two or three simliar C codecs into one. We have one C codec for every each python codecs currently. I have got an idea to merge them into several similar groups and many common part of .so binaries will be saved:
_codecs_jacodecs_1.so: euc-jp, shift-jis, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-ext _codecs_jacodecs_2.so: euc-jisx0213, shift-jisx0213, iso-2022-jp-3, euc-jis-2004, shift-jis-2004, iso-2022-jp-2004 _codecs_jacodecs_3.so: iso-2022-jp-2 _codecs_kocodecs_1.so: euc-kr, johab, iso-2022-kr _codecs_kocodecs_2.so: cp949 _codecs_zhcodecs_1.so: gb2312, gbk, gb18030, hz _codecs_zhcodecs_2.so: big5, cp950
+1, but why not put all Japanese codecs into one module and dito for the Korean and Chinese ones ? Note that todays OS linkers will only mmap those pieces of code into the process memory that are actually needed by the application, so even though the size of the modules increases, the application process memory foot-print is likely not to increase.
3. Split some mapping keeper modules to few group-based modules. This will save memory and spaces for who need only legacy codecs like "euc-kr only".
_codecs_mapdata_ko_KR -> _codecs_komapdata_1.so: KS X 1001 _codecs_komapdata_2.so: cp949
_codecs_mapdata_ja_JP -> _codecs_jamapdata_1.so: JIS X 0208, JIS X 0212 _codecs_jamapdata_2.so: JIS X 0213:2000 and :2004
_codecs_mapdata_zh_CN -> _codecs_zhmapdata_1.so: gb2312, gbk, gb18030
_codecs_mapdata_zh_TW -> _codecs_zhmapdata_2.so: big5, cp950
-1 See above: this is static C data, so splitting these won't really buy the user anything. If you don't believe this, compare the resident size of Python with and without unicodedata loaded. The difference on my machine is a measily 30kB, not the 250kB of the complete module.
If these sound acceptable for python-dev people, they will be implemented as CJKCodecs 1.1 first and imported into python later (before 2.4a1).
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 16 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On Wed, Jun 16, 2004 at 11:33:59AM +0200, M.-A. Lemburg wrote:
Hye-Shik Chang wrote: [snip]
2. Merge two or three simliar C codecs into one. We have one C codec for every each python codecs currently. I have got an idea to merge them into several similar groups and many common part of .so binaries will be saved:
_codecs_jacodecs_1.so: euc-jp, shift-jis, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-ext _codecs_jacodecs_2.so: euc-jisx0213, shift-jisx0213, iso-2022-jp-3, euc-jis-2004, shift-jis-2004, iso-2022-jp-2004 _codecs_jacodecs_3.so: iso-2022-jp-2 _codecs_kocodecs_1.so: euc-kr, johab, iso-2022-kr _codecs_kocodecs_2.so: cp949 _codecs_zhcodecs_1.so: gb2312, gbk, gb18030, hz _codecs_zhcodecs_2.so: big5, cp950
+1, but why not put all Japanese codecs into one module and dito for the Korean and Chinese ones ?
Note that todays OS linkers will only mmap those pieces of code into the process memory that are actually needed by the application, so even though the size of the modules increases, the application process memory foot-print is likely not to increase.
Okay. But how about embedded, freezed environments or statically compiled into python by uncommenting from Modules/Setup? If somebody need to support only legacy Japanese encodings, he will want to include a legacy mapping(70K) but will not want JIS X 0213(85K) and KS X 1001, GB2312 mappings(200K, for iso-2022-jp-2). And he may want to save spaces by just erasing files. In fact, I don't know how real Japanese developers use but just guessed it. :) [snip]
If you don't believe this, compare the resident size of Python with and without unicodedata loaded. The difference on my machine is a measily 30kB, not the 250kB of the complete module.
I do believe this. This is also why I wrote cjkcodecs in not pure Python but C extensions. Hye-Shik
On Wed, Jun 16, 2004 at 08:16:52PM +0900, Hye-Shik Chang wrote:
On Wed, Jun 16, 2004 at 11:33:59AM +0200, M.-A. Lemburg wrote:
Hye-Shik Chang wrote: [snip]
2. Merge two or three simliar C codecs into one. We have one C codec for every each python codecs currently. I have got an idea to merge them into several similar groups and many common part of .so binaries will be saved:
+1, but why not put all Japanese codecs into one module and dito for the Korean and Chinese ones ?
Note that todays OS linkers will only mmap those pieces of code into the process memory that are actually needed by the application, so even though the size of the modules increases, the application process memory foot-print is likely not to increase.
Okay. But how about embedded, freezed environments or statically compiled into python by uncommenting from Modules/Setup? If somebody need to support only legacy Japanese encodings, he will want to include a legacy mapping(70K) but will not want JIS X 0213(85K) and KS X 1001, GB2312 mappings(200K, for iso-2022-jp-2). And he may want to save spaces by just erasing files. In fact, I don't know how real Japanese developers use but just guessed it. :)
Aah. While I'm taking shower, I found that a problem on iso-2022-jp-2 can be resolved by make codecs to load mapping tables on demand. (they're loading mappings in init function currently.) I agree in incorporating all CJK codecs to each per-language codec collection modules. Thanks for the comments! Hye-Shik
Hye-Shik Chang wrote:
On Wed, Jun 16, 2004 at 11:33:59AM +0200, M.-A. Lemburg wrote:
Hye-Shik Chang wrote:
[snip]
2. Merge two or three simliar C codecs into one. We have one C codec for every each python codecs currently. I have got an idea to merge them into several similar groups and many common part of .so binaries will be saved:
_codecs_jacodecs_1.so: euc-jp, shift-jis, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-ext _codecs_jacodecs_2.so: euc-jisx0213, shift-jisx0213, iso-2022-jp-3, euc-jis-2004, shift-jis-2004, iso-2022-jp-2004 _codecs_jacodecs_3.so: iso-2022-jp-2 _codecs_kocodecs_1.so: euc-kr, johab, iso-2022-kr _codecs_kocodecs_2.so: cp949 _codecs_zhcodecs_1.so: gb2312, gbk, gb18030, hz _codecs_zhcodecs_2.so: big5, cp950
+1, but why not put all Japanese codecs into one module and dito for the Korean and Chinese ones ?
Note that todays OS linkers will only mmap those pieces of code into the process memory that are actually needed by the application, so even though the size of the modules increases, the application process memory foot-print is likely not to increase.
Okay. But how about embedded, freezed environments or statically compiled into python by uncommenting from Modules/Setup?
Same thing: the OS will only load those parts that are actually needed into memory. The only downside with having e.g. all modules statically linked into the python binary is the file size. OTOH, using static linking improves performance.
If somebody need to support only legacy Japanese encodings, he will want to include a legacy mapping(70K) but will not want JIS X 0213(85K) and KS X 1001, GB2312 mappings(200K, for iso-2022-jp-2). And he may want to save spaces by just erasing files. In fact, I don't know how real Japanese developers use but just guessed it. :)
Is this a common enough use case to warrant the added complexity of having to find the right _[123] mapping for the codec in question ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 16 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Hye-Shik Chang wrote:
Okay. But how about embedded, freezed environments or statically compiled into python by uncommenting from Modules/Setup? If somebody need to support only legacy Japanese encodings, he will want to include a legacy mapping(70K) but will not want JIS X 0213(85K) and KS X 1001, GB2312 mappings(200K, for iso-2022-jp-2).
People who want that have many options: the could go back to an older version of CJKCodecs, they could use Japanese codecs, they could write their own codecs based on libraries that are only available to the embedded Python, they could break down your modules again. For the average user, it does not matter much. For packaging and maintaining, I believe it is slightly simpler to have fewer files. So if people have an actual need for non-standard customization, they can contribute a patch. Regards, Martin
On Sat, Jun 19, 2004 at 02:04:11PM +0200, "Martin v. L?wis" wrote:
Hye-Shik Chang wrote:
Okay. But how about embedded, freezed environments or statically compiled into python by uncommenting from Modules/Setup? If somebody need to support only legacy Japanese encodings, he will want to include a legacy mapping(70K) but will not want JIS X 0213(85K) and KS X 1001, GB2312 mappings(200K, for iso-2022-jp-2).
People who want that have many options: the could go back to an older version of CJKCodecs, they could use Japanese codecs, they could write their own codecs based on libraries that are only available to the embedded Python, they could break down your modules again.
For the average user, it does not matter much. For packaging and maintaining, I believe it is slightly simpler to have fewer files.
Yeah. I just finished merging varities of codecs into few per-locale modules. before after (codecs+maps) _codecs_cn.so 159851 130769 _codecs_jp.so 340350 241307 _codecs_kr.so 150269 125508 _codecs_tw.so 110057 97567 _codecs_unicode.so 17050 12332 _multibytecodec.so 24438 24439 802015 631922 As a result, 166KB is saved by this unification. And, I guess that builtin codec initialization time in Windows may be saved also. :) Hye-Shik
participants (3)
-
"Martin v. Löwis" -
Hye-Shik Chang -
M.-A. Lemburg