[Python-3000] Character Set Indepencence

Guido van Rossum guido at python.org
Fri Sep 1 16:59:47 CEST 2006


I think in a sense Python *will* continue to support multiple
character sets -- as byte streams. IMO that's the only reasonable
approach. Unlike apparently Matz I've never heard complaints that
Python 2 doesn't have enough support for character sets larger than
Unicode, and that is effectively what it supports: encoded strings and
Unicode string.

--Guido

On 9/1/06, Paul Prescod <paul at prescod.net> wrote:
> I thought that others might find this reference interesting. It is Matz (the
> inventor of Ruby) talking about why he thinks that Unicode is good for what
> it does but not sufficient in general, along with some hints of what he
> plans for multinationalization in Ruby. The translation is rough and is
> lifted from this email:
>
> http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html
>
> I think that the gist of it is that Unicode will be "just one character set"
> supported by Ruby. This idea has been kicked around for Python before but
> you quickly run into questions about how you compare character strings from
> multiple character sets, to say nothing of the complexity of an character
> encoding and character set agnostic regular expression engine.
>
> I guess Matz is the right guy to experiment with that stuff. Maybe it could
> be copied in Python 4K.
> What are your complaints towards Unicode?
> * it's thoroughly used, isn't it.
> * resentment towards Han unification?
>
> * inferiority complex of Japanese people?
> --
> What are your complaints towards Unicode?
> * no, no I do not have any complaints about Unicode
> * in the domains where Unicode is adequate
> --
> Then, why CSI?
>
>
> In most applications, UCS is enough thanks to Unicode.
> However, there are also applications for which this is not the case.
> --
> Fields for which Unicode is not enough
> Big character sets
> * Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode)
>
> * TRON code
> * GB18030
> --
> Fields for which Unicode is not fitted
> Legacy encodings
> * conversion to UCS is useless
> * big conversion tables
> * round-trip problem
> --
> If a language chooses the UCS system
>
> * you cannot write non-UCS applications
> * you can't handle text that can't be expressed with Unicode
> --
> If a language chooses the CSI system
> * CSI is a superset of UCS
> * Unicode just has to be handled in CSI
>
> --
> ... is what we can say but
> * CSI is difficult
> * can it really be implemented?
> --
> That's where comes out Japan's traditional arts
>
> Adaptation for the Japanese language of applications
> * Modification of English language applications to be able to process
> Japanese
>
> --
> Adaptation for the Japanese language of applications
>
> * What engineers of long ago experienced for sure
>  - Emacs (NEmacs)
>  - Perl (JPerl)
>  - Bash
> --
> Accumulation of know-how
>
> In Japan, the know-how of adaptation for the Japanese language
>
> (multi-byte text processing)
> has been accumulated.
> --
> Accumulation of know-how
>
> in the first place, just for local use,
> text using 3 encodings circulate
> (4 if including UTF-8)
> --
> Based on this know-how
>
> * multibyte text encodings
> * switching between encodings at the string level
> * processing them at practical speed
> is finished
> --
> Available encodings
>
> euc_tw euc_jp iso8859_* utf-8 utf-32le
>
> ascii euc_kr koi8 utf-16le utf-32be
> big5 gb2312 sjis utf-16be
>
> ...and many others
> If it's a stateless encodings, in principle it can be available.
> --
> It means
> For applications using only one encoding, code conversion is not needed
>
> --
> Moreover
> Applications wanting to handle multiple encodings can choose an
> internal encoding (generally Unicode) that includes all others
> --
> If you want to
> * you can also handle multiple encodings without conversion, letting
>
> characters as they are
> * but this is difficult so I do not recommend it
> --
> However,
> only the basic part is done,
> it's far from being ready for practical use
> * code conversion
> * guessing encoding
>
> * etc.
> --
> For the time being, today
> I want to tell everyone:
> * UCS is practical
> * but not all-purpose
> * CSI is not impossible
> --
> The reason I'm saying that
> They may add CSI in Perl6 as they had added
>
> * Methods called by "."
> * Continuations
> from Ruby.
> Basically, they hate losing.
> --
> Thank you
>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe:
> http://mail.python.org/mailman/options/python-3000/guido%40python.org
>
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list