[Python-Dev] Adding Japanese Codecs to the distro

M.-A. Lemburg mal@lemburg.com
Thu, 16 Jan 2003 12:22:58 +0100


Martin v. L=F6wis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
>=20
>>Thoughts ?
>=20
> I'm in favour of adding support for Japanese codecs, but I wonder
> whether we shouldn't incorporate the C version of the Japanese codecs
> package instead, despite its size.

I was suggesting to make Suzuki's codecs the default. That
doesn't prevent Tamito's codecs from working, since these
are inside a package.

If someone wants the C codecs, we should provide them as
separate download right alongside of the standard distro (as
discussed several times before).

Note that the C codecs are not as easy to modify to special
needs as the Python ones. While this may seem unnecessary
I've heard from a few people that especially companies tend
to extend the mappings with their own set of company specific
code points.

> I would also suggest that it might be more worthwhile to expose
> platform codecs, which would give us all CJK codecs on a number of
> major platforms, with a minimum increase in the size of the Python
> distribution, and with very good performance.

+1

We already have this on Windows (via the mbcs codec). If you
could contribute your iconv codecs under the PSF license we'd
go a long way in that direction on Unix as well.

> *If* Suzuki's code is incorporated, I'd like to get independent
> confirmation that it is actually correct.=20

Since he built the codecs on the mappings in Java, this
looks like enough third party confirmation already.

> I know Tamito has taken many
> iterations until it was correct, where "correct" is a somewhat fuzzy
> term, since there are some really tricky issues for which there is no
> single one correct solution (like whether \x5c is a backslash or a Yen
> sign, in these encodings). I notice (with surprise) that the actual
> mapping tables are extracted from Java, through Jython.

Indeed. I think that this kind of approach is a good one in
the light of the "correctness" problems you mention above.
It also helps with the compatibility side.

> I also dislike absence of the cp932 encoding in Suzuki's codecs. The
> suggestion to equate this to "mbcs" on Windows is not convincing, as
> a) "mbcs" does not mean cp932 on all Windows installations, and b)
> cp932 needs to be processed on other systems, too. I *think* cp932
> could be implemented as a delta to shift-jis, as shown in
>=20
> http://hp.vector.co.jp/authors/VA003720/lpproj/test/cp932sj.htm
>=20
> (although I wonder why they don't list the backslash issue as a
> difference between shift-jis and cp932)

As always: contributions are welcome :-)

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/