[Python-ideas] Processing surrogates in
Steven D'Aprano
steve at pearwood.info
Thu May 7 17:31:24 CEST 2015
On Thu, May 07, 2015 at 02:41:34AM +0100, Rob Cliffe wrote:
> This is no doubt *not* the best platform to raise these thoughts (which
> are nothing to do with Python - apologies), but I'm not sure where else
> to go.
> I watch discussions like this ...
> I watch posts like this one [Nick's] ...
> ... And I despair. I really despair.
>
> I am a very experienced but old (some would say "dinosaur") programmer.
> I appreciate the need for Unicode. I really do.
> I don't understand Unicode and all its complications AT ALL.
> And I can't help wondering:
> Why, oh why, do things have to be SO FU*****G COMPLICATED? This
> thread, for example, is way over my head. And it is typical of many
> discussions I have stared at, uncomprehendingly.
> Surely 65536 (2-byte) encodings are enough to express all characters in
> all the languages in the world, plus all the special characters we need.
Not even close.
Unicode currently encodes over 74,000 CJK (Chinese/Japanese/Korean)
ideographs, which is comfortably larger than 2**16, so no 16-bit
encoding can handle the complete range of CJK characters.
It will probably take many more years before the entire CJK character
set is added to Unicode, simply because the characters left to add are
obscure and rare. Some may never be added at all, e.g. in 2007 Taiwan
withdrew a submission to add 6,545 characters used as personal names as
they were deemed to no longer be in use.
That's just *one* writing system. Then we add Latin, Cyrillic (Russian),
Greek/Coptic, Arabic, Hebrew, Korea's other writing system Hangul, Thai,
and dozens of others. (Fortunately, unlike Chinese characters, the other
writing systems typically need only a few dozen or hundred characters,
not tens of thousands.) Plus dozens of punctuation marks, symbols from
mathematics, linguistics, and much more. And the Unicode Consortium
projects that at least another five thousand characters will be added in
version 8, and probably more beyond that.
So no, two bytes is not enough.
Unicode actually fits into 21 bits, which is a bit less than three
bytes, but for machine efficiency four bytes will often be used.
> Why can't there be just *ONE* universal encoding? (Decided upon, no
> doubt, by some international standards committee. There would surely be
> enough spare codes for any special characters etc. that might come up in
> the foreseeable future.)
The problem isn't so much with the Unicode encodings (of which there are
only a handful, and most of the time you only use one, UTF-8) but with
the dozens and dozens of legacy encodings invented during the dark ages
before Unicode.
> *Is it just historical accident* (partly due to an awkward move from
> 1-byte ASCII to 2-byte Unicode, implemented in many different places, in
> many different ways) *that we now have a patchwork of encodings that we
> strive to fit into some over-complicated scheme*?
Yes, it is a historical accident. In the 1960s, 70s and 80s national
governments and companies formed a plethora of one-byte (and occasional
two-byte) encodings to support their own languages and symbols. E.g. in
the 1980s, Apple used their own idiosyncratic set of 256 characters,
which didn't match the 256 characters used on DOS, which was different
again from those on Amstrad...
Unicode was started in the 1990s to bring order to that chaos. If you
think things are complicated with Unicode, they would be much worse
without it.
> Or is there *really* some *fundamental reason* why things *can't* be
> simpler? (Like, REALLY, _*REALLY*_ simple?)
90% of the complexity is due to the history of text encodings on various
computer platforms. If people had predicted cheap memory and the
Internet back in the early 1960s, perhaps we wouldn't have ended up with
ASCII and the dozens of incompatible "Extended ASCII" encodings as we
know them today.
But the other 90% of the complexity is inherent to human languages. For
example, you know what the lower case of "I" is, don't you? It's "i".
But not in Turkey, which has both a dotted and dotless version:
I ı
İ i
(Strangely, as far as I know, nobody has a dotted J or dotless j.)
Consequently, Unicode has a bunch of complexity related to left-to-right
and right-to-left writing systems, accents, joiners, variant forms, and
other issues. But, unless you're actually writing in a language which
needs that, or writing a word-processor application, you can usually
ignore all of that and just treat them as "characters".
> Imageine if we were starting to design the 21st century from scratch,
> throwing away all the history? How would we go about it?
Well, for starters I would insist on re-introducing thorn þ and eth ð
back into English :-)
--
Steve
More information about the Python-ideas
mailing list