[Python-ideas] Processing surrogates in

Thu May 7 04:15:20 CEST 2015

On 2015-05-07 02:41, Rob Cliffe wrote:
> This is no doubt *not* the best platform to raise these thoughts (which
> are nothing to do with Python - apologies), but I'm not sure where else
> to go.
> I watch discussions like this ...
> I watch posts like this one [Nick's] ...
> ...  And I despair.  I really despair.
>
> I am a very experienced but old (some would say "dinosaur") programmer.
> I appreciate the need for Unicode.  I really do.
> I don't understand Unicode and all its complications AT ALL.
> And I can't help wondering:
>      Why, oh why, do things have to be SO FU*****G COMPLICATED?  This
> thread, for example, is way over my head.  And it is typical of many
> discussions I have stared at, uncomprehendingly.
> Surely 65536 (2-byte) encodings are enough to express all characters in
> all the languages in the world, plus all the special characters we need.
> Why can't there be just *ONE* universal encoding?  (Decided upon, no
> doubt, by some international standards committee. There would surely be
> enough spare codes for any special characters etc. that might come up in
> the foreseeable future.)
>
> *Is it just historical accident* (partly due to an awkward move from
> 1-byte ASCII to 2-byte Unicode, implemented in many different places, in
> many different ways) *that we now have a patchwork of encodings that we
> strive to fit into some over-complicated scheme*?
> Or is there *really* some *fundamental reason* why things *can't* be
> simpler?  (Like, REALLY, _*REALLY*_ simple?)
> Imageine if we were starting to design the 21st century from scratch,
> throwing away all the history?  How would we go about it?
> (Maybe I'm just naive, but sometimes ... Out of the mouths of babes and
> sucklings.)
> Aaaargh!  Do I really have to learn all this mumbo-jumbo?!  (Forgive me.
> :-) )
> I would be grateful for any enlightenment - thanks in advance.
> Rob Cliffe
>
When Unicode first came out, they thought that 65536 would be enough.
When Java was released, for example, it used 16 bits per codepoint.
Simple.

But it turned out that it wasn't enough. People have been too inventive
over thousands of years!

There's the matter of accents and other diacritics. Some languages want
to add marks to the letters to indicate a different pronunciation,
stress, tone, whatever (a character might need more than one!). Having
a separate code for each combination would lead to an _lot_ of codes,
so a better solution is to add codes that can combine with the base
character when displayed.

And then there's the matter of writing direction. Some languages go 
left-to-right, others right-to-left.

So, you think it's complicated? Don't blame Unicode, it's just trying
to cope with a very messy problem.