[Python-ideas] Processing surrogates in

Nick Coghlan ncoghlan at gmail.com
Thu May 7 07:27:14 CEST 2015


On 7 May 2015 at 11:41, Rob Cliffe <rob.cliffe at btinternet.com> wrote:
> Or is there really some fundamental reason why things can't be simpler?
> (Like, REALLY, REALLY simple?)

Yep, there are around 7 billion fundamental reasons currently alive,
and I have no idea how many that have gone before us: humans :)

Unicode is currently messy and complicated because human written
communication is messy and complicated, and that inherent complexity
didn't go anywhere once we started networking our computers together
and digitising our historical records.

Early versions of Unicode attempted to simplify things by only
considering dictionary words in major living languages (which got them
under 65k characters), but folks in Asia and elsewhere were
understandably upset when the designers attempted to explain why it
was OK for a "universal" encoding to not be able to correctly
represent the names of people and places, while archivists and
historical researchers were similarly unimpressed when the designers
tried to explain why their "universal" encoding didn't adequately
cover texts that were more than a few decades old. Breaking down the
walls between historically silo'ed communications networks then made
things even more complicated, as historical proprietary encodings from
different telco networks needed to be mapped to the global standard
(this last process is a large part of where the assortment of emoji
characters in Unicode comes from).

However, most of the messiness and complexity in the digital realm
actually arises at the boundary between Unicode and *other encodings*.
That's why the fact that POSIX still uses ASCII as the default
encoding is such a pain, and why Apple instead unilaterally declared
that "everything shall be UTF-8" for Mac OS X, while Microsoft and
Java eventually settled on new UTF-16 APIs. We can't even assume ASCII
compatibility in general, as codecs like Shift-JIS, ISO-2022 and
various other East Asian codecs date from an era where international
network connectivity simply wasn't a problem encoding designers needed
to worry about, so solving *local* computing problems was a much
larger concern than compatibility with DARPA's then nascent internet
protocols.

I wrote an article attempting to summarise some of that history last
year: http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/

And gave a presentation about it at Australia's OSDC 2014 that
connected some of the dots even further back in history:
https://www.youtube.com/watch?v=xOadSc69Hrw (I also just noticed my
notes for the latter aren't currently online, which is an oversight
I'll aim to fix before too long).

As things stand, one suggestion I make to folks truly trying to
understand why we need Unicode (with all its complexity), is to
attempt to learn a foreign language that *doesn't use a latin based
script*. My own Japanese is atrociously bad, but it's good enough that
I can appreciate just how Anglo-centric most programming languages
(including Python) are. I'm also fully cognizant of the fact that as
bad as my written and spoken Japanese are, my ability to enter
Japanese text into a computer is entirely non-existent.

> Imageine if we were starting to design the 21st century from scratch,
> throwing away all the history?  How would we go about it?

We'd invite Japanese, Chinese, Indian, African, etc developers to get
involved in the design process much earlier than we did. Ideally back
when the Western Union telegraph was first being designed, as the
consequences of some of those original binary encoding design choices
are still felt today :)

http://utf8everywhere.org/ makes the case that the closest we have to
that today is UTF-8 + streaming compression, and it's a fairly
compelling story. However, it's premised on a world where string
processing algorithms are all written to be UTF-8 aware, when a lot of
them, including those used in the Python standard library, were in
fact written assuming fixed width encodings. Hence the Python 3.3
flexible string representation model, where string internal storage is
sized according to the largest code point, and you need to use
StringIO if you want to avoid having a single higher plane code point
significantly increase the memory consumption of your string.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list