[Python-3000] Character Set Indepencence

Paul Prescod paul at prescod.net
Fri Sep 1 16:11:35 CEST 2006


I thought that others might find this reference interesting. It is Matz (the
inventor of Ruby) talking about why he thinks that Unicode is good for what
it does but not sufficient in general, along with some hints of what he
plans for multinationalization in Ruby. The translation is rough and is
lifted from this email:

http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html

I think that the gist of it is that Unicode will be "just one character set"
supported by Ruby. This idea has been kicked around for Python before but
you quickly run into questions about how you compare character strings from
multiple character sets, to say nothing of the complexity of an character
encoding and character set agnostic regular expression engine.

I guess Matz is the right guy to experiment with that stuff. Maybe it could
be copied in Python 4K.

What are your complaints towards Unicode?
* it's thoroughly used, isn't it.
* resentment towards Han unification?
* inferiority complex of Japanese people?
--
What are your complaints towards Unicode?
* no, no I do not have any complaints about Unicode
* in the domains where Unicode is adequate
--
Then, why CSI?

In most applications, UCS is enough thanks to Unicode.
However, there are also applications for which this is not the case.
--
Fields for which Unicode is not enough
Big character sets
* Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode)
* TRON code
* GB18030
--
Fields for which Unicode is not fitted
Legacy encodings
* conversion to UCS is useless
* big conversion tables
* round-trip problem
--
If a language chooses the UCS system
* you cannot write non-UCS applications
* you can't handle text that can't be expressed with Unicode
--
If a language chooses the CSI system
* CSI is a superset of UCS
* Unicode just has to be handled in CSI
--
... is what we can say but
* CSI is difficult
* can it really be implemented?
--
That's where comes out Japan's traditional arts

Adaptation for the Japanese language of applications
* Modification of English language applications to be able to process Japanese
--
Adaptation for the Japanese language of applications

* What engineers of long ago experienced for sure
  - Emacs (NEmacs)
  - Perl (JPerl)
  - Bash
--
Accumulation of know-how

In Japan, the know-how of adaptation for the Japanese language
(multi-byte text processing)
has been accumulated.
--
Accumulation of know-how

in the first place, just for local use,
text using 3 encodings circulate
(4 if including UTF-8)
--
Based on this know-how
* multibyte text encodings
* switching between encodings at the string level
* processing them at practical speed
is finished
--
Available encodings

euc_tw   euc_jp   iso8859_*  utf-8     utf-32le
ascii    euc_kr   koi8       utf-16le  utf-32be
big5     gb2312   sjis       utf-16be

...and many others
If it's a stateless encodings, in principle it can be available.
--
It means
For applications using only one encoding, code conversion is not needed
--
Moreover
Applications wanting to handle multiple encodings can choose an
internal encoding (generally Unicode) that includes all others
--
If you want to
* you can also handle multiple encodings without conversion, letting
characters as they are
* but this is difficult so I do not recommend it
--
However,
only the basic part is done,
it's far from being ready for practical use
* code conversion
* guessing encoding
* etc.
--
For the time being, today
I want to tell everyone:
* UCS is practical
* but not all-purpose
* CSI is not impossible
--
The reason I'm saying that
They may add CSI in Perl6 as they had added
* Methods called by "."
* Continuations
from Ruby.
Basically, they hate losing.
--
Thank you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060901/46576432/attachment.html 


More information about the Python-3000 mailing list