[I18n-sig] Re: [XML-SIG] Character encodings and expat

Andy Robinson andy@reportlab.com
Sun, 29 Oct 2000 07:14:01 -0000


> -----Original Message-----
> From: i18n-sig-admin@python.org [mailto:i18n-sig-admin@python.org]On
> Behalf Of Martin v. Loewis
> Sent: 27 October 2000 22:48
> To: larsga@garshol.priv.no
> Cc: i18n-sig@python.org; xml-sig@python.org
> Subject: [I18n-sig] Re: [XML-SIG] Character encodings and expat
>
>
> > That's only Shift-JIS and EUC-JP, though.  Is there any concerted
> > effort afoot to make a more complete set?  At the very least,
> > ISO 2022-JP, Big5, VISCII, GB-2312 and EUC-KR should be
> implemented.
>
That was the intention, but I admit we have run out of steam
somewhat.  Tamito Kajiyama is the only person to have made
a really big contribution. I was hoping to, but that hope
was on the basis of a large customer project needing
this stuff which got cancelled, and running a startup
is taking so much time that I won't manage much until
ReportLab gets a customer who needs to reencode data.
When that happens, we'll have to do it, and fast.

As an aside, we're doing the work to allow use of
Adobe's Asian Font Packs in reportlab at the moment,
and they use the native encodings.  So once that
comes out, we'll be under a lot of pressure to do it.
I am very hopeful of the first half of next year if no
one else has done the work already.

In the meantime, frankly, not enough people need
it badly enough and nobody but Tamito has had a go.
Volunteers welcome!


>I'm always concerned by the fact that every package seems to come
with
>its own set of conversion tables, instead on relying on other people
>to do a good job (and report bugs if they don't). Tcl has such
tables,
>Java does, X11 has some, ICU has more - I really can't see the reason
>to reimplement them all again in Python.

I don't use Tcl, Java or X11 and don't know what ICU
is, but I do use Python on several platforms and would
want to know that the encodings library worked
identically on all platforms - i.e. if there are bugs
in the codecs, they are consistent and can be fixed
consistently.  I think this issue was pretty much settled
in MAL's original i18n proposal.  However, no sane person
retypes mapping tables; if we built something Pythonic
we'd hopefully do it by extracting data from two different
sources, building our own tables and checking they got
identical results.  With compression into a Zip file
and careful use of diff-like techniques (all the obscure
Asian codecs go like 'take this base encoding and add
these extra code points'), I believe a good codec
database could be quite small.

- Andy Robinson