[Python-ideas] Processing surrogates in

Thu May 7 17:31:24 CEST 2015

On Thu, May 07, 2015 at 02:41:34AM +0100, Rob Cliffe wrote:
> This is no doubt *not* the best platform to raise these thoughts (which 
> are nothing to do with Python - apologies), but I'm not sure where else 
> to go.
> I watch discussions like this ...
> I watch posts like this one [Nick's] ...
> ...  And I despair.  I really despair.
> 
> I am a very experienced but old (some would say "dinosaur") programmer.
> I appreciate the need for Unicode.  I really do.
> I don't understand Unicode and all its complications AT ALL.
> And I can't help wondering:
>     Why, oh why, do things have to be SO FU*****G COMPLICATED?  This 
> thread, for example, is way over my head.  And it is typical of many 
> discussions I have stared at, uncomprehendingly.
> Surely 65536 (2-byte) encodings are enough to express all characters in 
> all the languages in the world, plus all the special characters we need.

Not even close.

Unicode currently encodes over 74,000 CJK (Chinese/Japanese/Korean)
ideographs, which is comfortably larger than 2**16, so no 16-bit 
encoding can handle the complete range of CJK characters. 

It will probably take many more years before the entire CJK character 
set is added to Unicode, simply because the characters left to add are 
obscure and rare. Some may never be added at all, e.g. in 2007 Taiwan 
withdrew a submission to add 6,545 characters used as personal names as 
they were deemed to no longer be in use.

That's just *one* writing system. Then we add Latin, Cyrillic (Russian), 
Greek/Coptic, Arabic, Hebrew, Korea's other writing system Hangul, Thai, 
and dozens of others. (Fortunately, unlike Chinese characters, the other 
writing systems typically need only a few dozen or hundred characters, 
not tens of thousands.) Plus dozens of punctuation marks, symbols from 
mathematics, linguistics, and much more. And the Unicode Consortium 
projects that at least another five thousand characters will be added in 
version 8, and probably more beyond that.

So no, two bytes is not enough.

Unicode actually fits into 21 bits, which is a bit less than three 
bytes, but for machine efficiency four bytes will often be used.

> Why can't there be just *ONE* universal encoding?  (Decided upon, no 
> doubt, by some international standards committee. There would surely be 
> enough spare codes for any special characters etc. that might come up in 
> the foreseeable future.)

The problem isn't so much with the Unicode encodings (of which there are 
only a handful, and most of the time you only use one, UTF-8) but with 
the dozens and dozens of legacy encodings invented during the dark ages 
before Unicode.

> *Is it just historical accident* (partly due to an awkward move from 
> 1-byte ASCII to 2-byte Unicode, implemented in many different places, in 
> many different ways) *that we now have a patchwork of encodings that we 
> strive to fit into some over-complicated scheme*?

Yes, it is a historical accident. In the 1960s, 70s and 80s national 
governments and companies formed a plethora of one-byte (and occasional 
two-byte) encodings to support their own languages and symbols. E.g. in 
the 1980s, Apple used their own idiosyncratic set of 256 characters, 
which didn't match the 256 characters used on DOS, which was different 
again from those on Amstrad...

Unicode was started in the 1990s to bring order to that chaos. If you 
think things are complicated with Unicode, they would be much worse 
without it.

> Or is there *really* some *fundamental reason* why things *can't* be 
> simpler?  (Like, REALLY, _*REALLY*_ simple?)

90% of the complexity is due to the history of text encodings on various 
computer platforms. If people had predicted cheap memory and the 
Internet back in the early 1960s, perhaps we wouldn't have ended up with 
ASCII and the dozens of incompatible "Extended ASCII" encodings as we 
know them today.

But the other 90% of the complexity is inherent to human languages. For 
example, you know what the lower case of "I" is, don't you? It's "i". 
But not in Turkey, which has both a dotted and dotless version:

    I ı 
    İ i 

(Strangely, as far as I know, nobody has a dotted J or dotless j.)

Consequently, Unicode has a bunch of complexity related to left-to-right 
and right-to-left writing systems, accents, joiners, variant forms, and 
other issues. But, unless you're actually writing in a language which 
needs that, or writing a word-processor application, you can usually 
ignore all of that and just treat them as "characters".

> Imageine if we were starting to design the 21st century from scratch, 
> throwing away all the history?  How would we go about it?

Well, for starters I would insist on re-introducing thorn þ and eth ð 
back into English :-)

-- 
Steve