[Python-3000] Character Set Indepencence
"Martin v. Löwis"
martin at v.loewis.de
Wed Sep 13 06:10:38 CEST 2006
Paul Prescod schrieb:
> I think that the gist of it is that Unicode will be "just one character
> set" supported by Ruby. This idea has been kicked around for Python
> before but you quickly run into questions about how you compare
> character strings from multiple character sets, to say nothing of the
> complexity of an character encoding and character set agnostic regular
> expression engine.
As Guido says, the arguments for "CSI (character set independence)"
are hardly convincing. Yes, there are cases where Unicode doesn't
"round-trip", but they are so obscure that they (IMO) can be ignored
safely.
There are two problems in this respect with Unicode:
- in some cases, a character set may contain characters that are
not included in Unicode. This was a serious problem for a while
for Chinese for quite some time, but I believe this is now
fixed, with the plane-2 additions. If just round-tripping is
the goal, then it is always possible for a codec to map characters
to the private-use areas of Unicode. This is not optimal,
since a different codec may give a different meaning to the
same PUA characters, but there should be rarely a need to
use them in the first place.
- in some cases, the input encoding has multiple representations
for what becomes the same character in Unicode. For example,
in ISO-2022-jp, there are three ways to encode the latin
letters (either in ASCII, or in the romaji part of
either JIS X 0208-1978 or JIS X 0208-1983). You can switch
between these in a single string; if you go back and forth
through Unicode, you get a normalized version that
.encode gives you. While I have seen people bringing it
up now and then, I don't recall anybody claiming that this
is a real, practical problem.
There is a third problem that people often associate with
Unicode: due to the Han unification, you don't know whether
a certain Han character originates from Chinese, Japanese,
or Korean. This is a problem when rendering Unicode: you
don't know what glyphs to use (as you should use different
glyphs depending on the natural language). With CSI, you
can use a "language-aware encoding": you use a Japanese
encoding for Japanese text, and so on, then use the encoding
to determine what the language is.
For Unicode, there are several ways to deal with it:
- you could carry language information along with the
original text. This is what is commonly done in the
web: you put language information into the HTML,
and then use that to render the text correctly.
- you could embed language information into the Unicode
string, using the plane-14 tag characters. This
should work fairly nicely, since you only need
a single piece of information, but has some drawbacks:
* you need four-byte Unicode, or surrogates
* if you slice such a string, the slices won't
carry the language tag
* applications today typically don't know how to
deal with tag characters
- you could guess the language from the content, based
on the frequency of characters (e.g. presence
of katakana/hiragana would indicate that it is
Japanese). As with all guessing, there are
cases where it fails. I believe that web browsers
commonly apply that approach, anyway.
Regards,
Martin
More information about the Python-3000
mailing list