[I18n-sig] Intro + Encoding names issue

Matt Gushee matt.gushee@fourthought.com
Fri, 26 Oct 2001 14:10:55 +0000


Greetings, i14ers and l6ers--

I'm a developer at Fourthought, Inc., and I've been charged with
ensuring that our XML software will play nice with all languages in
the known universe.

Okay, maybe that's an exaggeration. But at least we would like to
support all languages/encodings that are commonly used on the
Internet. So you'll probably be hearing from me periodically. I will
probably be the one asking dumb questions -- while I have some
linguistics b.g., am bilingual+ (fluent in Japanese, a bit of Chinese
& Spanish), and know a bit about character encodings, this will be my
first serious attempt at software i18n. So have mercy ...

On to the namespace issue. Let me preface this by saying, if this has
already been thoroughly discussed, please feel free to point me to the
relevant thread. I did dig through several months' worth of the list
archives, and didn't find any discussion of this problem.

The other day I was putting together some Japanese-language test cases
for 4Suite server, and I found out that, although they work fine when
the source and output are in UTF-8, Shift-JIS and EUC-JP don't work
because the Japanese codecs need to be referenced with a 'japanese'
prefix:

	'japanese.euc_jp'
	'japanese.shift_jis'

I would like to be able to reference these encodings by their
conventional names, e.g. just plain 'euc-jp'; I was able to make it
work* by tweaking encodings/aliases.py, but that isn't really
satisfactory as a permanent solution.

    * Not completely true ... there is some code in 4Suite/PyXML that
      wrongly returns 'not-well-formed' errors on EUC-JP and Shift_JIS
      documents ... but that's a separate issue from accessing the
      codecs. Anyway, the alias hack allowed me to do simple I/O
      operations using the standard encoding names.

I understand that this modularization makes sense from a
code-maintenance standpoint, but the need for a language-specific
prefix is a real stumbling block for developing applications intended
to handle arbitrary encodings. Sure, I could tweak the 4Suite code to
alias the japanese encoding names in an appropriate fashion, but then
what happens when 'koi8-r' becomes 'russian.koi8-r'? ... and so on.

I would suggest that codecs development place a high priority on these
principles:

* Developers should be able to use the codecs API without anticipating
  every encoding that might be used.
* End users should be able to install and use internationalized Python
  programs without knowing how the codecs work.

So the ideal would be a solution that allowed codecs developers to
maintain separate packages, but have their component modules "plugged
in" to the encodings namespace on installation.

Apparently the above is impossible or at least very difficult with
Distutils. Maybe a workable compromise would be to have some sort of
codecs installation utility that would let end users, in one simple
step, insert a set of codecs into the main encodings namespace.

So, what are your thoughts on this? Again, if I am rehashing previous
threads, I'll be happy to review them if you can let me know where
they are.

-- 
Matt Gushee                               Consultant
matt.gushee@fourthought.com               +1 303 583 9900 x108
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Boulder, CO 80301-2537, USA
XML strategy, XML tools (http://4Suite.org), knowledge management