[I18n-sig] Re: [XML-SIG] Character encodings and expat
Lars Marius Garshol
31 Oct 2000 12:01:24 +0100
* Lars Marius Garshol
| That's only Shift-JIS and EUC-JP, though. Is there any concerted
| effort afoot to make a more complete set? At the very least, ISO
| 2022-JP, Big5, VISCII, GB-2312 and EUC-KR should be implemented.
* Andy Robinson
| That was the intention, but I admit we have run out of steam
| somewhat. Tamito Kajiyama is the only person to have made a really
| big contribution. [...] Volunteers welcome!
Then I may have a go at it if I can find the time. I've written
codecs for all these in C++ over the past few weeks, so it should be a
simple job to redo it for Python. (It was for a closed-source
project, so it can unfortunately not be reused directly.)
| However, no sane person retypes mapping tables; if we built
| something Pythonic we'd hopefully do it by extracting data from two
| different sources, building our own tables and checking they got
| identical results.
www.unicode.org provides mapping tables that are really easy to parse
with a Python script in order to build tables.
| With compression into a Zip file and careful use of diff-like
| techniques (all the obscure Asian codecs go like 'take this base
| encoding and add these extra code points'), I believe a good codec
| database could be quite small.
My binary collection of conversion tables for ISO 8859 1->15,
Windows-12xx, koi8-r, VISCII, Shift-JIS, EUC-JP, ISO 2022-JP, Big5,
EUC-KR and GB-2312 is about 90k.