[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Fri Sep 12 19:24:15 CEST 2014

Jeff Allen writes:

 > Simply having a block "for private use" seems to create an unmanaged 
 > space for conflict,

No.  The uncharted range of human language (including recently-
invented nonsense like "emoticons" and the annual "design a character"
contest run by a newpaper in Taipei, with the grand prize being your
character gets added to the national standard IIRC, but maybe it's
just that newspaper's collection of private space characters) already
contains those conflicts.  Believe me, "private use space, manage it
yourself" was the best they could do.

I've been working with the beureaucratic insanity of the Japanese
national standard -- it took almost 3 decades before every Japanese
citizen could store their names in a computer using government-
approved codes -- and the chaos of the Taiwanese national standard --
which contains hordes of characters with one known use and no known
meaning, many of them duplicates -- for twenty years now.  Neither
approach works as well as Unicode's, despite its design-by-committee
flaws overlaid with national animosities that can flare into
linguicidal vetoes and code-space-stuffing logrolling.

 > reminiscent of the "other 128 characters" in bilingual
 > programming. I wondered if the way to respect use by applications
 > might be to make it private to a particular sub-class of str, idly
 > however.

If I understand your suggestion, that's precisely the intent of PEP
383, to make undecodable bytes in a coded character stream private.
But they need to be in the stream one way or another.  So PEP 383
chose to use a non-Unicode encoding (based on the "lone surrogate"
device invented by Markus Kuhn for utf-8b) to deal with that, and that
does effectively make those elements private to Python (but of course
not in the Unicode sense, as they're not even characters in Unicode).

But I gather the "native" Unicode type in Java doesn't allow you to
use that dodge because it checks for malformed Unicode internally (ie,
at a level not controllable by Jython).  So you have to embed such
stream elements in the space of Unicode characters.  You have the
option of the private space or unallocated (reserved) space.  The
latter seems like asking for trouble, and the only way to avoid it
would be to be prepared to move that data around in case of collision.
But that's precisely what I'm suggesting doing in private space.  Same
issue, either way.  Private space with a local registry seems saner.