Mailman 3 Adding Japanese Codecs to the distro - Python-Dev

Adding Japanese Codecs to the distro

M.-A. Lemburg

Jan. 16, 2003

4:32 a.m.

Hisao SUZUKI has just recently uploaded a patch to SF which includes codecs for the Japanese encodings EUC-JP, Shift_JIS and ISO-2022-JP and wants to contribute the code to the PSF. The advantage of his codecs over the ones written by Tamito KAJIYAMA (http://www.asahi-net.or.jp/~rd6t-kjym/python/) lies in the fact that Hisao's codecs are small (88kB) and written in pure Python. This makes it much easier to adapt the codecs to special needs or to correct errors. Provided Hisao volunteers to maintain these codecs, I'd like to suggest adding them to Python's encodings package and making them the default implementations for the above encodings. Ideal would be if we could get Hisao and Tamito to team up to support these codecs (I put him on CC). Adding the codecs to the distribution would give Python a very good argument in the Japanese world and also help people working with XML or HTML targetting these locales. Thoughts ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Show replies by date

martin＠v.loewis.de

January 2003

5:05 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

Thoughts ?

I'm in favour of adding support for Japanese codecs, but I wonder whether we shouldn't incorporate the C version of the Japanese codecs package instead, despite its size. I would also suggest that it might be more worthwhile to expose platform codecs, which would give us all CJK codecs on a number of major platforms, with a minimum increase in the size of the Python distribution, and with very good performance. *If* Suzuki's code is incorporated, I'd like to get independent confirmation that it is actually correct. I know Tamito has taken many iterations until it was correct, where "correct" is a somewhat fuzzy term, since there are some really tricky issues for which there is no single one correct solution (like whether \x5c is a backslash or a Yen sign, in these encodings). I notice (with surprise) that the actual mapping tables are extracted from Java, through Jython. I also dislike absence of the cp932 encoding in Suzuki's codecs. The suggestion to equate this to "mbcs" on Windows is not convincing, as a) "mbcs" does not mean cp932 on all Windows installations, and b) cp932 needs to be processed on other systems, too. I *think* cp932 could be implemented as a delta to shift-jis, as shown in http://hp.vector.co.jp/authors/VA003720/lpproj/test/cp932sj.htm (although I wonder why they don't list the backslash issue as a difference between shift-jis and cp932) Regards, Martin

Atsuo Ishimoto

6:08 a.m.

Hello from Japan, On 16 Jan 2003 11:05:55 +0100 martin@v.loewis.de (Martin v. Lvwis) wrote:

...

I also vote for JapaneseCodec. Talking about it's size, JapaneseCodec package is much lager because it contains both C version and pure Python version. Size of C version part of JapaneseCodec is about 160kb(compiled on Windows platform), and I don't think it makes difference.

...

Yes, Tamito's JapaneseCodec has been used for years by many Japanese users, while I've never heard about Suzuki's one.

...

Agreed.

...

http://www.ingrid.org/java/i18n/unicode-utf8.html may be better reference. This page is written in English with utf-8. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

M.-A. Lemburg

6:22 a.m.

Martin v. Löwis wrote:

...

"M.-A. Lemburg" <mal@lemburg.com> writes:

...
Thoughts ?

I'm in favour of adding support for Japanese codecs, but I wonder whether we shouldn't incorporate the C version of the Japanese codecs package instead, despite its size.

I was suggesting to make Suzuki's codecs the default. That doesn't prevent Tamito's codecs from working, since these are inside a package. If someone wants the C codecs, we should provide them as separate download right alongside of the standard distro (as discussed several times before). Note that the C codecs are not as easy to modify to special needs as the Python ones. While this may seem unnecessary I've heard from a few people that especially companies tend to extend the mappings with their own set of company specific code points.

...

I would also suggest that it might be more worthwhile to expose platform codecs, which would give us all CJK codecs on a number of major platforms, with a minimum increase in the size of the Python distribution, and with very good performance.

+1 We already have this on Windows (via the mbcs codec). If you could contribute your iconv codecs under the PSF license we'd go a long way in that direction on Unix as well.

...

*If* Suzuki's code is incorporated, I'd like to get independent confirmation that it is actually correct.

Since he built the codecs on the mappings in Java, this looks like enough third party confirmation already.

...

I know Tamito has taken many iterations until it was correct, where "correct" is a somewhat fuzzy term, since there are some really tricky issues for which there is no single one correct solution (like whether \x5c is a backslash or a Yen sign, in these encodings). I notice (with surprise) that the actual mapping tables are extracted from Java, through Jython.

Indeed. I think that this kind of approach is a good one in the light of the "correctness" problems you mention above. It also helps with the compatibility side.

...

I also dislike absence of the cp932 encoding in Suzuki's codecs. The suggestion to equate this to "mbcs" on Windows is not convincing, as a) "mbcs" does not mean cp932 on all Windows installations, and b) cp932 needs to be processed on other systems, too. I *think* cp932 could be implemented as a delta to shift-jis, as shown in

http://hp.vector.co.jp/authors/VA003720/lpproj/test/cp932sj.htm

(although I wonder why they don't list the backslash issue as a difference between shift-jis and cp932)

As always: contributions are welcome :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

martin＠v.loewis.de

7:02 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

I was suggesting to make Suzuki's codecs the default. That doesn't prevent Tamito's codecs from working, since these are inside a package.

I wonder who will be helped by adding these codecs, if anybody who needs to process Japanese data on a regular basis will have to install that other package, anyway.

...

If someone wants the C codecs, we should provide them as separate download right alongside of the standard distro (as discussed several times before).

I still fail to see the rationale for that (or, rather, the rationale seems to vanish more and more). AFAIR, "size" was brought up as an argument against the code. However, the code base already contains huge amounts of code that not everybody needs, and the size increase on a binary distribution is rather minimal.

...

Note that the C codecs are not as easy to modify to special needs as the Python ones. While this may seem unnecessary I've heard from a few people that especially companies tend to extend the mappings with their own set of company specific code points.

The Python codecs are not easy to modify, either: there is a large generated table, and you actually have to understand the generation algorithm, augment it, run it through Jython. After that, you get a new mapping table, which you need to carry around *instead* of the one shipped with Python. So any user who wants to extend the mapping needs the generator more than the generated output. If you want to augment the codec as-is, i.e. by wrapping it, you best install a PEP 293 error handler. This works nicely both with C codecs and pure Python codecs (out of the box, it probably works with neither of the candidate packages, but that would have to be fixed). Or, if you don't go the PEP 293, you can still use a plain wrapper around both codecs.

...

We already have this on Windows (via the mbcs codec).

That is insufficient, though, since it gives access to a single platform codec only. I have some code sitting around that exposes the codecs from inet.dll (or some such); this is the codec library that IE6 uses.

...

If you could contribute your iconv codecs under the PSF license we'd go a long way in that direction on Unix as well.

Ok, will do. There are still some issues with the code itself that need to be fixed, then I'll contribute it.

...

...
*If* Suzuki's code is incorporated, I'd like to get independent confirmation that it is actually correct.

Since he built the codecs on the mappings in Java, this looks like enough third party confirmation already.

Not really. I *think* Sun has, when confronted with a popularity-or-correctness issue, taken the popularity side, leaving correctness alone. Furthermore, the code doesn't use the Java tables throughout, but short-cuts them. E.g. in shift_jis.py, we find if i < 0x80: # C0, ASCII buf.append(chr(i)) where i is a Unicode codepoint. I believe this is incorrect: In shift-jis, 0x5c is YEN SIGN, and indeed, the codec goes on with elif i == 0xA5: # Yen buf.append('\\') So it maps both REVERSE SOLIDUS and YEN SIGN to 0x5c; this is an error (if it was a CP932 codec, it might (*) have been correct). See http://rf.net/~james/Japanese_Encodings.txt Regards, Martin (*) I'm not sure here, it also might be that Microsoft maps YEN SIGN to the full-width yen sign, in CP 932.

Hye-Shik Chang

6:38 a.m.

On Thu, Jan 16, 2003 at 11:05:55AM +0100, Martin v. L?wis wrote:

...

And, the most important merit that C version have but Pure version doesn't is sharing library texts inter processes. Most modern OSes can share them and C version is even smaller than Python version in case of KoreanCodecs 2.1.x (on CVS) Here's process status on FreeBSD 5.0/i386 with Python 2.3a1(of 2003-01-15) system. USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND perky 56713 0.0 1.2 3740 3056 p3 S+ 8:11PM 0:00.08 python : python without any codecs perky 56739 6.3 5.7 15376 14728 p3 S+ 8:17PM 0:04.02 python : python with python.cp949 codec perky 56749 0.0 1.2 3884 3196 p3 S+ 8:20PM 0:00.06 python : python with c.cp949 codec alice(perky):/usr/pkg/lib/python2.3/site-packages/korean% size _koco.so text data bss dec hex filename 122861 1844 32 124737 1e741 _koco.so On C codec, processes shares 122861 bytes on system-wide and consumes only 1844 bytes each, besides on Pure codec consumes 12 Mega bytes each. This must concerned very seriously for launching time of have "# encoding: euc-jp" or something CJK encodings.

...

KoreanCodecs is tested on {Free,Net,Open}BSD, Linux, Solaris, HP-UX, Windows{95,98,NT,2000,XP}, Cygwin without any platform #ifdef's. I sure that any CJK codecs can be ported into any platforms that Python is ported. Regards, Hye-Shik =)

Guido van Rossum

9:41 a.m.

...

Assuming the code is good, this seems the right thing from a technical perspective, but I'm worried what Tamito will think about it. Also, are there (apart from implementation technology) differences in features between the two? Do they always produce the same results? Would this kill Tamito's codecs, or are those still preferred for people doing a lot of Japanese? --Guido van Rossum (home page: http://www.python.org/~guido/)

martin＠v.loewis.de

10:02 a.m.

Guido van Rossum <guido@python.org> writes:

...

Also, are there (apart from implementation technology) differences in features between the two? Do they always produce the same results?

The JapaneseCodecs package comes with both Python and C versions of the codecs. It includes more encodings, in particular the cp932 codec, which is used on Windows (cp932 used to be understood as a synonym for shift-jis, but that understanding is incorrect, so these are considered as two different encodings these days). I believe they produce different output, but haven't tested. Hisao complains that Tamito's codecs don't include the full source for the generated files, but I believe (without testing) that you just need a few files from the Unicode consortium to generate all source code.

...

Would this kill Tamito's codecs, or are those still preferred for people doing a lot of Japanese?

As long as Python doesn't provide cp932, people will still install the JapaneseCodecs. Regards, Martin

barry＠python.org

9:47 a.m.

...

...
...
...
...
"MAL" == M <mal@lemburg.com> writes:

MAL> Adding the codecs to the distribution would give Python a MAL> very good argument in the Japanese world and also help people MAL> working with XML or HTML targetting these locales. +1 MAL> Ideal would be if we could get Hisao and Tamito to team up MAL> to support these codecs (I put him on CC). Yes, please. I've been using Tamito's codecs in Mailman and the Japanese users on my list have never complained, so I take that as that they're doing their job well. Let's please try to get consensus before we choose one or the other, but I agree, I'd love to see them in Python. What about other Asian codecs? -Barry

M.-A. Lemburg

11:05 a.m.

Barry A. Warsaw wrote:

...

...
...
...
...
...
"MAL" == M <mal@lemburg.com> writes:

MAL> Adding the codecs to the distribution would give Python a MAL> very good argument in the Japanese world and also help people MAL> working with XML or HTML targetting these locales.

+1

MAL> Ideal would be if we could get Hisao and Tamito to team up MAL> to support these codecs (I put him on CC).

Yes, please. I've been using Tamito's codecs in Mailman and the Japanese users on my list have never complained, so I take that as that they're doing their job well.

I'm not biased in any direction here. Again, I'd love to see the two sets be merged into one, e.g. take the Python ones from Hisao and use the C ones from Tamito if they are installed instead.

...

Let's please try to get consensus before we choose one or the other, but I agree, I'd love to see them in Python.

Sure.

...

What about other Asian codecs?

The other codecs in the SF Python Codecs project have license and maintenance problems. Most of these stem from Tamito's original codec which was under GPL. There are plenty other encodings we'd need to cover most of the Asian scripts. However, in order for them to be usable we'll have to find people willing to maintain them or at least make sure that they fit the need and are correct in their operation (where "correct" means usable in real life). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Hye-Shik Chang

1:43 p.m.

On Thu, Jan 16, 2003 at 05:05:21PM +0100, M.-A. Lemburg wrote:

...

KoreanCodecs in the SF Korean Python Codecs (http://sf.net/projects/koco) a.k.a KoCo is changed from PSF License to LGPL on Barry's request in early 2002. Because it has many fancy codecs that isn't used in korean real world, I'd like to make an essence of KoreanCodecs in PSF License if python needs it. KoCo implementation is the only widely-used codec set for korean encodings and I can maintain if it needs. Regards, Hye-Shik =)

barry＠python.org

8:26 p.m.

...

...
...
...
...
"HC" == Hye-Shik Chang <perky@fallin.lv> writes:

HC> KoreanCodecs in the SF Korean Python Codecs HC> (http://sf.net/projects/koco) a.k.a KoCo is changed from PSF HC> License to LGPL on Barry's request in early 2002. Which I really appreciated because it made it easier to include the codecs in Mailman. But I won't have to worry about that if the codecs are part of Python, and a PSF license probably makes most sense for that purpose. You own the copyrights, correct? If so, there should be no problem re-licensing it under the PSF license for use in Python. Thanks! -Barry

Atsuo Ishimoto

11:41 a.m.

On Thu, 16 Jan 2003 17:05:21 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

Here again, I entreat to add Tamito's C version as standard Japanese codec. It's fast, is proven quite stable and correct. If people need to customize mapping table(this may happen in some cases, but not common task. I believe almost 100% of Japanese programmers never had wrote such a special mapping table), and if they think Tamito's codec is too difficult to customize(while I don't think so), and only if they are satisfied with performance of codec written in Python, they will download and install the Python version of codec. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

M.-A. Lemburg

3:07 p.m.

Atsuo Ishimoto wrote:

...

Wouldn't it be better to use Hisao's codec per default and revert to Tamito's in case that's installed in the system ? We also need active maintainers for the codecs. I think ideal would be to get Hisao share this load -- Hisao for the Python version and Tamito for the C one. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

martin＠v.loewis.de

4:15 p.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

Wouldn't it be better to use Hisao's codec per default and revert to Tamito's in case that's installed in the system ?

I still don't see the rationale of incorporating code that has less functions and less performance instead of incorporating code that has more functions and more performance.

...

That is a valid point. Has Hisao volunteered to maintain his code? Regards, Martin

barry＠python.org

4:40 p.m.

...

...
...
...
...
"MvL" == Martin v Löwis <martin@v.loewis.de> writes:

>> We also need active maintainers for the codecs. I think ideal >> would be to get Hisao share this load -- Hisao for the Python >> version and Tamito for the C one. MvL> That is a valid point. Has Hisao volunteered to maintain his MvL> code? AFAICT, Tamito is pretty good about maintaining his codec. I see new versions announced fairly regularly. -Barry

Atsuo Ishimoto

7:22 p.m.

On Tue, 21 Jan 2003 16:40:19 -0500 barry@python.org (Barry A. Warsaw) wrote:

...

(cc'ing another Tamito's mail addr. Tamito, are you wake up?) I believe he will continue to maintain it. Of cource, I and people in the Japanese Python community will help him. I don't expect such kind of community effort for Hisao's codec. Active users in Japan will continue to use Tamio's one, and don't care Python version is broken or not. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

M.-A. Lemburg

4:29 a.m.

Atsuo Ishimoto wrote:

...

On Tue, 21 Jan 2003 16:40:19 -0500 barry@python.org (Barry A. Warsaw) wrote:

...
...
...
...
...
>"MvL" == Martin v L?is <martin@v.loewis.de> writes:

...
...
We also need active maintainers for the codecs. I think ideal would be to get Hisao share this load -- Hisao for the Python version and Tamito for the C one.

MvL> That is a valid point. Has Hisao volunteered to maintain his MvL> code?

Yes.

...

...
AFAICT, Tamito is pretty good about maintaining his codec. I see new versions announced fairly regularly.

I wasn't saying that he doesn't maintain the code. Indeed, he does a very good job at it.

...

(cc'ing another Tamito's mail addr. Tamito, are you wake up?)

I believe he will continue to maintain it. Of cource, I and people in the Japanese Python community will help him. I don't expect such kind of community effort for Hisao's codec. Active users in Japan will continue to use Tamio's one, and don't care Python version is broken or not.

Hmm, there seems to be a very strong feeling towards Tamito's codecs in the Japanese community. The problem I see is size: Tamito's codecs have an installed size of 1790kB while Hisao's codecs are around 81kB. That's why I was suggesting to use Hisao's codecs as default and to revert to Tamito's in case they are installed (much like you'd use cStringIO instead of StringIO if it's installed). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

martin＠v.loewis.de

5:18 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

The problem I see is size: Tamito's codecs have an installed size of 1790kB while Hisao's codecs are around 81kB.

It isn't quite that bad: You need to count the "c" directory only, which is 690kB on my system.

...

The analogy isn't that good here: it would be more similar if StringIO was incomplete, e.g. would be lacking a .readlines() function, so you would have no choice but to use cStringIO if you happen to need .readlines(). Regards, Martin

M.-A. Lemburg

6:45 a.m.

Martin v. Löwis wrote:

...

I was looking at the directory which gets installed to site-packages. That's 1790kB on my system.

...

But you get the picture ... ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Atsuo Ishimoto

5:25 a.m.

On Wed, 22 Jan 2003 10:29:54 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

The problem I see is size: Tamito's codecs have an installed size of 1790kB while Hisao's codecs are around 81kB.

You cannot compare size of untared files here. Tamito's codecs package contains source of C version and Python version. About 1 MB in 1790kB is size of C sources. So, I'm proposing to add only C version of codec from JapaneseCodecs package. As I mentioned, size of C version is about 160 KB in Win32 binary form, excluding tests and documentations. I don't see a significant difference between them. If size of C sources(about 1 MB) is matter, we may be able to reduce it.

...

Hmm, I assume cStringIO is installed always. I use StringIO only if I want to subclass StringIO class. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

M.-A. Lemburg

7:06 a.m.

Atsuo Ishimoto wrote:

...

I was talking about the *installed* size, ie. the size of the package in site-packages: degas site-packages/japanese# du 337 ./c 1252 ./mappings 88 ./python 8 ./aliases 1790 . Hisao's Python codec is only 85kB in size. Now, if we took the only the C version of Tamito's codec, we'd end up with around 1790 - 1252 - 88 = 450 kB. Still a factor of 5... I wonder whether it wouldn't be possible to use the same tricks Hisao used in his codec for a C version.

...

The source code size is not that important. The install size is and even more the memory footprint. Hisao's approach uses a single table which fits into 58kB Python source code. Boil that down to a static C table and you'll end up with something around 10-20kB for static C data. Hisao does still builds a dictionary using this data, but perhaps that step could be avoided using the same techniques that Fredrik used in boiling down the size of the unicodedata module (which holds the Unicode Database). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Hye-Shik Chang

7:49 a.m.

On Wed, Jan 22, 2003 at 01:06:47PM +0100, M.-A. Lemburg wrote: [snip]

...

The trick must not be used in C version. Because C codecs need to keep both of encoding and decoding maps as constants so that share texts inter processes and load the data only once in the whole system. This does matter for multiprocess daemons especially. Hye-Shik =)

M.-A. Lemburg

8:33 a.m.

Hye-Shik Chang wrote:

...

Why not ? Anything that can trim down the memory footprint as well as the installation size is welcome :-)

...

Indeed, that's why the Unicode database is also stored this way. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Atsuo Ishimoto

7:50 a.m.

On Wed, 22 Jan 2003 13:06:47 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

I was talking about the *installed* size, ie. the size of the package in site-packages:

I'm sorry for my misunderstanding.

...

Now, if we took the only the C version of Tamito's codec, we'd end up with around 1790 - 1252 - 88 = 450 kB. Still a factor of 5...

Please try strip ./c/_japanese_codecs.so In my linux box, this reduces size of _japanese_codecs.so from 530 KB into 135 KB. I think this is reasonable size because it contains more tables than Hisao's version.

...

Hisao's approach uses a single table which fits into 58kB Python source code. Boil that down to a static C table and you'll end up with something around 10-20kB for static C data. Hisao does still builds a dictionary using this data, but perhaps that step could be avoided using the same techniques that Fredrik used in boiling down the size of the unicodedata module (which holds the Unicode Database).

Thank you for your advice. I will try it later, if you still think JapaneseCodec is too large. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

M.-A. Lemburg

8:37 a.m.

Atsuo Ishimoto wrote:

...

Ok, we're finally approaching a very reasonable size :-) BTW, why is it that Hisao can use one table for all supported encodings where Tamito uses 6 tables ?

...

That would be great, thanks ! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Atsuo Ishimoto

9:02 a.m.

On Wed, 22 Jan 2003 14:37:08 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

Ok, we're finally approaching a very reasonable size :-)

I'm really grad to hear so.

...

BTW, why is it that Hisao can use one table for all supported encodings where Tamito uses 6 tables ?

There are several kind of character set used in Japan. His codec supports only two character set called JIS X 0201 and 0208. Tamito's codec supports other character sets such as JIS X 0212 or Microsoft's extended charactor set called cp932.

...

...
Thank you for your advice. I will try it later, if you still think JapaneseCodec is too large.

That would be great, thanks !

I'm not sure this is effective or not, though. Mappling tables under current implementation are well condensed. Anyway, I'll try to reduce size. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

martin＠v.loewis.de

9:23 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

I was talking about the *installed* size, ie. the size of the package in site-packages:

Right. And we are trying to tell you that this is irrelevant when talking about the size increase to be expected when JapaneseCodecs is incorporated into Python.

...

degas site-packages/japanese# du 337 ./c 1252 ./mappings 88 ./python 8 ./aliases

You should ignore mappings and python in your counting, they are not needed.

...

I wonder whether it wouldn't be possible to use the same tricks Hisao used in his codec for a C version.

I believe it does use the same tricks. It's just that the JapaneseCodecs package supports a number of widely-used encodings which Hisao's package does not support.

...

The source code size is not that important. The install size is and even more the memory footprint.

Computing the memory footprint is very difficult, of course.

...

Hisao's approach uses a single table which fits into 58kB Python source code. Boil that down to a static C table and you'll end up with something around 10-20kB for static C data.

How did you obtain this number?

...

Hisao does still builds a dictionary using this data, but perhaps that step could be avoided using the same techniques that Fredrik used in boiling down the size of the unicodedata module (which holds the Unicode Database).

Perhaps, yes. Have you studied the actual data to see whether these techniques might help or not? Regards, Martin

M.-A. Lemburg

10:08 a.m.

Martin v. Löwis wrote:

...

Why is it irrelevant ? If it would be irrelevant Fredrik wouldn't have invested so much time in trimming down the footprint of the Unicode database. What we need is a generic approach here which works for more than just the Japanese codecs. I believe that those codecs could provide a good basis for more codecs from the Asian locale, but before adding megabytes of mapping tables, I'd prefer to settle for a good design first.

...

By looking at the code. It uses Unicode literals to define the table.

...

It's just a hint: mapping tables are all about fast lookup vs. memory consumption and that's what Fredrik's approach of decomposition does rather well (Tamito already uses such an approach). cdb would provide an alternative approach, but there are licensing problems... -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

martin＠v.loewis.de

10:50 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

Because the size increase you have reported won't be the size increase observed if JapaneseCodecs is incorporated into Python.

...

The trie approach in unicodedata requires that many indices have equal entries, and that, when grouping entries into blocks, multiple blocks can be found. This is not the case for CJK mappings, as there is no inherent correlation between the code points in some CJK encoding and the equivalent Unicode code point. In Unicode, the characters have seen Han Unification, and are sorted according to the sorting principles of Han Unification. In other encodings, other sorting principles have been applied, and no unification has taken place. Insofar chunks of the encoding are more systematic, the JapaneseCodecs package already employs algorithmic mappings, see _japanese_codecs.c, e.g. for the mapping of ASCII, or the 0201 halfwidth characters. Regards, Martin

Tamito KAJIYAMA

11:14 a.m.

Atsuo Ishimoto <ishimoto@gembook.org> writes: | | (cc'ing another Tamito's mail addr. Tamito, are you wake up?) Sorry for the late participation. Things go fast and my thought is very slow... I know the python-dev list is a highly technical place of discussions, but I'd like to explain my personal situation and related matters. On my situation: I'm a doctoral candidate and my job has come to a very tough period. I do want to volunteer for the great task of incorporating JapaneseCodecs into the Python distro, but I'm not sure that I have enough spare time to do it. I don't want to admit I cannot do that, but it's very likely. On the efficiency of my codecs: Honestly speaking, the priorities with regard to time and space efficiencies during the development of JapaneseCodecs were very low. I believe there is much room for improvements. The set of mapping tables in the pure Python codecs would be the very first candidate. On Suzuki-san's codecs: I had never imagined that JapaneseCodecs would have a competitor. I think my codecs package is a good product, but I don't have such strong confidence that Suzuki-san has on his work. I believe his codecs package deserves *the* default Japanese codecs package only because of his positive commitment among other advantages. Anyway, I'm very glad that Atsuo has expressed his favor on my codecs. And, thank you, Guido. I was really relieved with your thoughtful kindness. Regards, -- KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>

M.-A. Lemburg

3:36 p.m.

Tamito KAJIYAMA wrote:

...

Thanks for joining in. I had hoped to hear a word from you on the subject :-)

...

Ok, how about this: we include the C versions of your codecs in the distribution and you take over maintenance as soon as time permits. Still, I'd would love to see some further improvement of the size and performance of the codecs (and maybe support for the new error callbacks; something which Hisao has integrated into his codecs). Would it be possible for you two to team up for the further developement of the Japanese codecs ? Perhaps Hye-Shik Chang could join you in the effort, since he's the author of the KoreanCodecs package which has somewhat similar problem scope (that of stateful encodings with a huge number of mappings) ? Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"Martin v. Löwis"

5:07 p.m.

M.-A. Lemburg wrote:

...

I believe (without checking in detail) that the "statefulness" is also an issue in these codecs. Many of the CJK encodings aren't stateful beyond being multi-byte (except for the iso-2022 ones). IOW, there is a non-trivial state only if you process the input byte-for-byte: you have to know whether you are a the first or second byte (and what the first byte was if you are at the second byte). AFAICT, both Japanese codecs assume that you can always look at the second byte when you get the first byte. Of course, this assumption is wrong if you operate in a stream mode, and read the data in, say, chunks of 1024 bytes: such a chunk may split exactly between a first and second byte (*). In these cases, I believe, both codecs would give incorrect results. Please correct me if I'm wrong. Regards, Martin (*) The situation is worse for GB 18030, which also has 4-byte encodings.

Tamito KAJIYAMA

5:50 p.m.

"Martin v. Löwis" <martin@v.loewis.de> writes: | | M.-A. Lemburg wrote: | > Perhaps Hye-Shik Chang could join you in the effort, since he's | > the author of the KoreanCodecs package which has somewhat | > similar problem scope (that of stateful encodings with a huge | > number of mappings) ? | | I believe (without checking in detail) that the "statefulness" is also | an issue in these codecs. | | Many of the CJK encodings aren't stateful beyond being multi-byte | (except for the iso-2022 ones). IOW, there is a non-trivial state only | if you process the input byte-for-byte: you have to know whether you are | a the first or second byte (and what the first byte was if you are at | the second byte). AFAICT, both Japanese codecs assume that you can | always look at the second byte when you get the first byte. Right, as far as my codecs are concerned. All decoders in the JapaneseCodecs package assume that the input byte sequence does not end in a middle of a multi-byte character. The iso-2022 decoders even assume that the input sequence is a "valid" text as defined in RFC1468 (i.e. the text must end in the US ASCII mode). However, AFAIK, these assumptions in the decoders seem well-accepted in the real world applications. The StreamReader/Writer classes in JapaneseCodecs can cope with the statefulness, BTW. -- KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>

martin＠v.loewis.de

4:04 a.m.

Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

...

The StreamReader/Writer classes in JapaneseCodecs can cope with the statefulness, BTW.

I see. I was really concerned about the stream reader only; I agree that it is perfectly reasonable to assume that an individual string is "complete" with regard to the encoding. Notice that your codec is thus advanced over both the Python UTF-8 codec, and Hisao's codec: neither of those care about this issue; this is a bug in both. Regards, Martin

Tamito KAJIYAMA

5:23 p.m.

"M.-A. Lemburg" <mal@lemburg.com> writes: | | Ok, how about this: we include the C versions of your codecs | in the distribution and you take over maintenance as soon | as time permits. I agree. -- KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>

Walter Dörwald

1:05 p.m.

M.-A. Lemburg wrote:

...

As far as I can tell SF patch #666484 does not include full support for error callbacks, it only special cases "backslashreplace" in util.py, but there's never any call to codecs.lookup_error() to deal with unknown handler names.

...

[...]

Bye, Walter Dörwald

Hye-Shik Chang

February 2003

3:51 p.m.

On Wed, Jan 22, 2003 at 09:36:17PM +0100, M.-A. Lemburg wrote:

...

I just submitted a patch for incorporating Korean Codecs (SF #684142). It's just around 55KB as a stripped 32bit binary. I'm studying Japanese and Chinese encodings nowadays. I'll try to make a patch for Tamito's JapaneseCodecs and Chen Chien-Hsun's ChineseCodecs for PEP293 support and small binary size if I can do. :)

...

Regards, Hye-Shik =)

martin＠v.loewis.de

January 2003

5:05 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

Thoughts ?

Atsuo Ishimoto

6:08 a.m.

Hello from Japan, On 16 Jan 2003 11:05:55 +0100 martin@v.loewis.de (Martin v. Lvwis) wrote:

...

Yes, Tamito's JapaneseCodec has been used for years by many Japanese users, while I've never heard about Suzuki's one.

...

Agreed.

...

M.-A. Lemburg

6:22 a.m.

Martin v. Löwis wrote:

...

"M.-A. Lemburg" <mal@lemburg.com> writes:

...
Thoughts ?

I'm in favour of adding support for Japanese codecs, but I wonder whether we shouldn't incorporate the C version of the Japanese codecs package instead, despite its size.

...

I would also suggest that it might be more worthwhile to expose platform codecs, which would give us all CJK codecs on a number of major platforms, with a minimum increase in the size of the Python distribution, and with very good performance.

+1 We already have this on Windows (via the mbcs codec). If you could contribute your iconv codecs under the PSF license we'd go a long way in that direction on Unix as well.

...

*If* Suzuki's code is incorporated, I'd like to get independent confirmation that it is actually correct.

Since he built the codecs on the mappings in Java, this looks like enough third party confirmation already.

...

I know Tamito has taken many iterations until it was correct, where "correct" is a somewhat fuzzy term, since there are some really tricky issues for which there is no single one correct solution (like whether \x5c is a backslash or a Yen sign, in these encodings). I notice (with surprise) that the actual mapping tables are extracted from Java, through Jython.

Indeed. I think that this kind of approach is a good one in the light of the "correctness" problems you mention above. It also helps with the compatibility side.

...

I also dislike absence of the cp932 encoding in Suzuki's codecs. The suggestion to equate this to "mbcs" on Windows is not convincing, as a) "mbcs" does not mean cp932 on all Windows installations, and b) cp932 needs to be processed on other systems, too. I *think* cp932 could be implemented as a delta to shift-jis, as shown in

http://hp.vector.co.jp/authors/VA003720/lpproj/test/cp932sj.htm

(although I wonder why they don't list the backslash issue as a difference between shift-jis and cp932)

martin＠v.loewis.de

7:02 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

I was suggesting to make Suzuki's codecs the default. That doesn't prevent Tamito's codecs from working, since these are inside a package.

I wonder who will be helped by adding these codecs, if anybody who needs to process Japanese data on a regular basis will have to install that other package, anyway.

...

If someone wants the C codecs, we should provide them as separate download right alongside of the standard distro (as discussed several times before).

...

Note that the C codecs are not as easy to modify to special needs as the Python ones. While this may seem unnecessary I've heard from a few people that especially companies tend to extend the mappings with their own set of company specific code points.

...

We already have this on Windows (via the mbcs codec).

...

If you could contribute your iconv codecs under the PSF license we'd go a long way in that direction on Unix as well.

Ok, will do. There are still some issues with the code itself that need to be fixed, then I'll contribute it.

...

...
*If* Suzuki's code is incorporated, I'd like to get independent confirmation that it is actually correct.

Since he built the codecs on the mappings in Java, this looks like enough third party confirmation already.

Hye-Shik Chang

6:38 a.m.

On Thu, Jan 16, 2003 at 11:05:55AM +0100, Martin v. L?wis wrote:

...

Guido van Rossum

9:41 a.m.

...

martin＠v.loewis.de

January 2003

10:02 a.m.

Guido van Rossum <guido@python.org> writes:

...

Also, are there (apart from implementation technology) differences in features between the two? Do they always produce the same results?

...

Would this kill Tamito's codecs, or are those still preferred for people doing a lot of Japanese?

As long as Python doesn't provide cp932, people will still install the JapaneseCodecs. Regards, Martin

barry＠python.org

9:47 a.m.

...

...
...
...
...
"MAL" == M <mal@lemburg.com> writes:

M.-A. Lemburg

11:05 a.m.

Barry A. Warsaw wrote:

...

...
...
...
...
...
"MAL" == M <mal@lemburg.com> writes:

MAL> Adding the codecs to the distribution would give Python a MAL> very good argument in the Japanese world and also help people MAL> working with XML or HTML targetting these locales.

+1

MAL> Ideal would be if we could get Hisao and Tamito to team up MAL> to support these codecs (I put him on CC).

Yes, please. I've been using Tamito's codecs in Mailman and the Japanese users on my list have never complained, so I take that as that they're doing their job well.

I'm not biased in any direction here. Again, I'd love to see the two sets be merged into one, e.g. take the Python ones from Hisao and use the C ones from Tamito if they are installed instead.

...

Let's please try to get consensus before we choose one or the other, but I agree, I'd love to see them in Python.

Sure.

...

What about other Asian codecs?

Hye-Shik Chang

1:43 p.m.

On Thu, Jan 16, 2003 at 05:05:21PM +0100, M.-A. Lemburg wrote:

...

barry＠python.org

8:26 p.m.

...

...
...
...
...
"HC" == Hye-Shik Chang <perky@fallin.lv> writes:

Atsuo Ishimoto

11:41 a.m.

On Thu, 16 Jan 2003 17:05:21 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

M.-A. Lemburg

January 2003

3:07 p.m.

Atsuo Ishimoto wrote:

...

martin＠v.loewis.de

4:15 p.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

Wouldn't it be better to use Hisao's codec per default and revert to Tamito's in case that's installed in the system ?

I still don't see the rationale of incorporating code that has less functions and less performance instead of incorporating code that has more functions and more performance.

...

That is a valid point. Has Hisao volunteered to maintain his code? Regards, Martin

barry＠python.org

4:40 p.m.

...

...
...
...
...
"MvL" == Martin v Löwis <martin@v.loewis.de> writes:

Atsuo Ishimoto

7:22 p.m.

On Tue, 21 Jan 2003 16:40:19 -0500 barry@python.org (Barry A. Warsaw) wrote:

...

M.-A. Lemburg

4:29 a.m.

Atsuo Ishimoto wrote:

...

On Tue, 21 Jan 2003 16:40:19 -0500 barry@python.org (Barry A. Warsaw) wrote:

...
...
...
...
...
>"MvL" == Martin v L?is <martin@v.loewis.de> writes:

...
...
We also need active maintainers for the codecs. I think ideal would be to get Hisao share this load -- Hisao for the Python version and Tamito for the C one.

MvL> That is a valid point. Has Hisao volunteered to maintain his MvL> code?

Yes.

...

...
AFAICT, Tamito is pretty good about maintaining his codec. I see new versions announced fairly regularly.

I wasn't saying that he doesn't maintain the code. Indeed, he does a very good job at it.

...

(cc'ing another Tamito's mail addr. Tamito, are you wake up?)

I believe he will continue to maintain it. Of cource, I and people in the Japanese Python community will help him. I don't expect such kind of community effort for Hisao's codec. Active users in Japan will continue to use Tamio's one, and don't care Python version is broken or not.

martin＠v.loewis.de

5:18 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

The problem I see is size: Tamito's codecs have an installed size of 1790kB while Hisao's codecs are around 81kB.

It isn't quite that bad: You need to count the "c" directory only, which is 690kB on my system.

...

M.-A. Lemburg

January 2003

6:45 a.m.

Martin v. Löwis wrote:

...

I was looking at the directory which gets installed to site-packages. That's 1790kB on my system.

...

Atsuo Ishimoto

5:25 a.m.

On Wed, 22 Jan 2003 10:29:54 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

The problem I see is size: Tamito's codecs have an installed size of 1790kB while Hisao's codecs are around 81kB.

...

Hmm, I assume cStringIO is installed always. I use StringIO only if I want to subclass StringIO class. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

M.-A. Lemburg

7:06 a.m.

Atsuo Ishimoto wrote:

...

Hye-Shik Chang

7:49 a.m.

On Wed, Jan 22, 2003 at 01:06:47PM +0100, M.-A. Lemburg wrote: [snip]

...

M.-A. Lemburg

8:33 a.m.

Hye-Shik Chang wrote:

...

Why not ? Anything that can trim down the memory footprint as well as the installation size is welcome :-)

...

Atsuo Ishimoto

7:50 a.m.

On Wed, 22 Jan 2003 13:06:47 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

I was talking about the *installed* size, ie. the size of the package in site-packages:

I'm sorry for my misunderstanding.

...

Now, if we took the only the C version of Tamito's codec, we'd end up with around 1790 - 1252 - 88 = 450 kB. Still a factor of 5...

...

Hisao's approach uses a single table which fits into 58kB Python source code. Boil that down to a static C table and you'll end up with something around 10-20kB for static C data. Hisao does still builds a dictionary using this data, but perhaps that step could be avoided using the same techniques that Fredrik used in boiling down the size of the unicodedata module (which holds the Unicode Database).

Thank you for your advice. I will try it later, if you still think JapaneseCodec is too large. -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.jp

M.-A. Lemburg

January 2003

8:37 a.m.

Atsuo Ishimoto wrote:

...

Ok, we're finally approaching a very reasonable size :-) BTW, why is it that Hisao can use one table for all supported encodings where Tamito uses 6 tables ?

...

Atsuo Ishimoto

9:02 a.m.

On Wed, 22 Jan 2003 14:37:08 +0100 "M.-A. Lemburg" <mal@lemburg.com> wrote:

...

Ok, we're finally approaching a very reasonable size :-)

I'm really grad to hear so.

...

BTW, why is it that Hisao can use one table for all supported encodings where Tamito uses 6 tables ?

...

...
Thank you for your advice. I will try it later, if you still think JapaneseCodec is too large.

That would be great, thanks !

martin＠v.loewis.de

9:23 a.m.

"M.-A. Lemburg" <mal@lemburg.com> writes:

...

I was talking about the *installed* size, ie. the size of the package in site-packages:

Right. And we are trying to tell you that this is irrelevant when talking about the size increase to be expected when JapaneseCodecs is incorporated into Python.

...

degas site-packages/japanese# du 337 ./c 1252 ./mappings 88 ./python 8 ./aliases

You should ignore mappings and python in your counting, they are not needed.

...

I wonder whether it wouldn't be possible to use the same tricks Hisao used in his codec for a C version.

I believe it does use the same tricks. It's just that the JapaneseCodecs package supports a number of widely-used encodings which Hisao's package does not support.

...

The source code size is not that important. The install size is and even more the memory footprint.

Computing the memory footprint is very difficult, of course.

...

Hisao's approach uses a single table which fits into 58kB Python source code. Boil that down to a static C table and you'll end up with something around 10-20kB for static C data.

How did you obtain this number?

...

Hisao does still builds a dictionary using this data, but perhaps that step could be avoided using the same techniques that Fredrik used in boiling down the size of the unicodedata module (which holds the Unicode Database).

Perhaps, yes. Have you studied the actual data to see whether these techniques might help or not? Regards, Martin