[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 08:32:00 CEST 2014

Steven D'Aprano writes:

[long example]

 > Am I right so far?
 > 
 > So the email package uses the surrogate-escape error handler and ends up 
 > with this Unicode string:
 > 
 > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
 > 
 > which can be encoded back to the bytes we started with.

Yes.

 > Note that technically those three \u... code points are NOT classified 
 > as "noncharacters".

Very unpythonic terminology, easily confusing the nonspecialist.  Or
the specialist -- I used to know that Unicode gave "noncharacter" a
technical definition but it seems I forgot.  But then, Unicode isn't a
PSF product, so I guess it's OK to be unpythonic.<wink/>

 > They are actually surrogate code points:
 > 
 > http://www.unicode.org/faq/private_use.html#nonchar4
 > http://www.unicode.org/glossary/#surrogate_code_point
 > 
 > and they're supposed to be reserved for UTF-16. I'm not sure of the 
 > implication of that.

It means that any Python program that invokes the surrogateescape
handler is not a "conforming Unicode process", at least not on the
naive interpretation of that definition.  A conforming process would
interpret them as corrupt characters and raise as soon as detected.

A more sophisticated interpretation might argue that Python is
multiple processes (in the sense of "process" used by Unicode), and
that the Unicode standard only applies to characters.  This is
especially true of Pythons implementing PEP 393, since no surrogates
should ever appear in text[1] at all.  Then the smuggled bytes can be
treated as noncharacters in practice although technically it's a
violation of the Unicode standard to do so.

Footnotes: 
[1]  Meaning, no fair using chr() to inject them into str!