[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Fri Sep 12 17:37:59 CEST 2014

On September 11, 2014, Jeff Allen wrote:

> ... the area of code point
> space used for the smuggling of bytes under PEP-383 is not a 
> "Unicode Private Use Area", but a portion of the trailing surrogate 
> range. This is a code violation, which I imagine is why 
> "surrogateescape" is an error handler, not a codec.

True, but I believe that is a CPython implementation detail.

Other implementations (including jython) should implement the
surrogatescape API, but I don't think it is important to use the
same internal representation for the invalid bytes.

(Well, unless you want to communicate with external tools (GUIs?)
that are trying to directly use (effectively bytes rather than
strings) in that particular internal encoding when communicating
with python.)

> lone surrogates preclude a naive use of the platform string library

Invalid input often causes problems.  Are you saying that there are
situations where the platform string library could easily handle
invalid characters in general, but has a problem with the specific
case of lone surrogates?

-jJ

--

If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them.  -jJ