[I18n-sig] Re: [XML-SIG] Character encodings and expat

Lars Marius Garshol larsga@garshol.priv.no
31 Oct 2000 12:01:24 +0100


* Lars Marius Garshol
|
| That's only Shift-JIS and EUC-JP, though.  Is there any concerted
| effort afoot to make a more complete set?  At the very least, ISO
| 2022-JP, Big5, VISCII, GB-2312 and EUC-KR should be implemented.

* Andy Robinson
|
| That was the intention, but I admit we have run out of steam
| somewhat.  Tamito Kajiyama is the only person to have made a really
| big contribution. [...] Volunteers welcome!

Then I may have a go at it if I can find the time.  I've written
codecs for all these in C++ over the past few weeks, so it should be a
simple job to redo it for Python.  (It was for a closed-source
project, so it can unfortunately not be reused directly.)
 
| However, no sane person retypes mapping tables; if we built
| something Pythonic we'd hopefully do it by extracting data from two
| different sources, building our own tables and checking they got
| identical results. 

www.unicode.org provides mapping tables that are really easy to parse
with a Python script in order to build tables.

| With compression into a Zip file and careful use of diff-like
| techniques (all the obscure Asian codecs go like 'take this base
| encoding and add these extra code points'), I believe a good codec
| database could be quite small.

My binary collection of conversion tables for ISO 8859 1->15,
Windows-12xx, koi8-r, VISCII, Shift-JIS, EUC-JP, ISO 2022-JP, Big5,
EUC-KR and GB-2312 is about 90k.

--Lars M.