[Python-ideas] Processing surrogates in

Sat May 16 05:56:07 CEST 2015

Paul Moore writes:

 > One case I'd found a need for text->text handling (although not
 > related to surrogates) was taking arbitrary Unicode and applying an
 > error handler to it before writing it to a stream with "strict"
 > encoding. (So something like "arbitrary text".encode('latin1',
 > 'errors='backslashescape').decode('latin1')).

That's not the use case envisioned for these functions, though.  You
want to change the textual content of the stream (by restricting the
repertoire), not change the representation of non-textual content.

 > The encode/decode pair seemed ugly, although it was the only way I
 > could find.

I find the fact that there's an output stream with an inappropriate
error handler far uglier!

Note that the encode/decode pair is quite efficient, although the
"rehandle" function could be about twice as fast.  Still, if you're
output-bound by the speed of a disk or the like, encode/decode will
have no trouble keeping up.

 > I could easily imagine using a "rehandle" type of function for this
 > (although I wouldn't use the actual proposed functions here, as the
 > use of "surrogate" and "astral" in the names would lead me to
 > assume they were inappropriate).

AFAICT, you'd be right -- they don't (as proposed) handle your use
case of restricting to a Unicode subset.  Your kind of use case is why
I think general repertoire filtering functions in unicodedata (or a
new unicodetools package) would be a much better home for this
functionality.

 > Whether that's an argument for or against the idea that they are an
 > attractive nuisance, I'm not sure :-)

I think your use case is quite independent of that issue.