[Python-ideas] Processing surrogates in
Stephen J. Turnbull
stephen at xemacs.org
Sat May 16 05:56:07 CEST 2015
Paul Moore writes:
> One case I'd found a need for text->text handling (although not
> related to surrogates) was taking arbitrary Unicode and applying an
> error handler to it before writing it to a stream with "strict"
> encoding. (So something like "arbitrary text".encode('latin1',
> 'errors='backslashescape').decode('latin1')).
That's not the use case envisioned for these functions, though. You
want to change the textual content of the stream (by restricting the
repertoire), not change the representation of non-textual content.
> The encode/decode pair seemed ugly, although it was the only way I
> could find.
I find the fact that there's an output stream with an inappropriate
error handler far uglier!
Note that the encode/decode pair is quite efficient, although the
"rehandle" function could be about twice as fast. Still, if you're
output-bound by the speed of a disk or the like, encode/decode will
have no trouble keeping up.
> I could easily imagine using a "rehandle" type of function for this
> (although I wouldn't use the actual proposed functions here, as the
> use of "surrogate" and "astral" in the names would lead me to
> assume they were inappropriate).
AFAICT, you'd be right -- they don't (as proposed) handle your use
case of restricting to a Unicode subset. Your kind of use case is why
I think general repertoire filtering functions in unicodedata (or a
new unicodetools package) would be a much better home for this
functionality.
> Whether that's an argument for or against the idea that they are an
> attractive nuisance, I'm not sure :-)
I think your use case is quite independent of that issue.
More information about the Python-ideas
mailing list