
"M.-A. Lemburg" <mal@lemburg.com> writes:
I was suggesting to make Suzuki's codecs the default. That doesn't prevent Tamito's codecs from working, since these are inside a package.
I wonder who will be helped by adding these codecs, if anybody who needs to process Japanese data on a regular basis will have to install that other package, anyway.
If someone wants the C codecs, we should provide them as separate download right alongside of the standard distro (as discussed several times before).
I still fail to see the rationale for that (or, rather, the rationale seems to vanish more and more). AFAIR, "size" was brought up as an argument against the code. However, the code base already contains huge amounts of code that not everybody needs, and the size increase on a binary distribution is rather minimal.
Note that the C codecs are not as easy to modify to special needs as the Python ones. While this may seem unnecessary I've heard from a few people that especially companies tend to extend the mappings with their own set of company specific code points.
The Python codecs are not easy to modify, either: there is a large generated table, and you actually have to understand the generation algorithm, augment it, run it through Jython. After that, you get a new mapping table, which you need to carry around *instead* of the one shipped with Python. So any user who wants to extend the mapping needs the generator more than the generated output. If you want to augment the codec as-is, i.e. by wrapping it, you best install a PEP 293 error handler. This works nicely both with C codecs and pure Python codecs (out of the box, it probably works with neither of the candidate packages, but that would have to be fixed). Or, if you don't go the PEP 293, you can still use a plain wrapper around both codecs.
We already have this on Windows (via the mbcs codec).
That is insufficient, though, since it gives access to a single platform codec only. I have some code sitting around that exposes the codecs from inet.dll (or some such); this is the codec library that IE6 uses.
If you could contribute your iconv codecs under the PSF license we'd go a long way in that direction on Unix as well.
Ok, will do. There are still some issues with the code itself that need to be fixed, then I'll contribute it.
*If* Suzuki's code is incorporated, I'd like to get independent confirmation that it is actually correct.
Since he built the codecs on the mappings in Java, this looks like enough third party confirmation already.
Not really. I *think* Sun has, when confronted with a popularity-or-correctness issue, taken the popularity side, leaving correctness alone. Furthermore, the code doesn't use the Java tables throughout, but short-cuts them. E.g. in shift_jis.py, we find if i < 0x80: # C0, ASCII buf.append(chr(i)) where i is a Unicode codepoint. I believe this is incorrect: In shift-jis, 0x5c is YEN SIGN, and indeed, the codec goes on with elif i == 0xA5: # Yen buf.append('\\') So it maps both REVERSE SOLIDUS and YEN SIGN to 0x5c; this is an error (if it was a CP932 codec, it might (*) have been correct). See http://rf.net/~james/Japanese_Encodings.txt Regards, Martin (*) I'm not sure here, it also might be that Microsoft maps YEN SIGN to the full-width yen sign, in CP 932.