[Python-ideas] Processing surrogates in

Fri May 15 14:21:29 CEST 2015

On 15 May 2015 at 02:02, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> (3) Problem: Code you can't or won't fix buggily passes you Unicode
>     that might have surrogates in it.
>     Solution: text-to-text codecs (but I don't see why they can't be
>     written as encode-decode chains).
>
> As I've written before, I think text-to-text codecs are an attractive
> nuisance.  The temptation to use them in most cases should be refused,
> because it's a better solution to deal with the problem at the
> incoming boundary or the outgoing boundary (using str<->bytes codecs).

One case I'd found a need for text->text handling (although not
related to surrogates) was taking arbitrary Unicode and applying an
error handler to it before writing it to a stream with "strict"
encoding. (So something like "arbitrary text".encode('latin1',
'errors='backslashescape').decode('latin1')).

The encode/decode pair seemed ugly, although it was the only way I
could find. I could easily imagine using a "rehandle" type of function
for this (although I wouldn't use the actual proposed functions here,
as the use of "surrogate" and "astral" in the names would lead me to
assume they were inappropriate).

Whether that's an argument for or against the idea that they are an
attractive nuisance, I'm not sure :-)

Paul