[Python-ideas] Processing surrogates in

Wed May 13 18:22:41 CEST 2015

(Note: I've posted to the issue suggesting we defer further
consideration to 3.6, as well as suggesting a new "string.internals"
submodule as a possible home for them, but I'm following up here to
capture my current thinking on the topic)

On 7 May 2015 4:47 pm, "Nick Coghlan" <ncoghlan at gmail.com> wrote:
> Regardless of which specific approach you take, handling surrogates
> explicitly when a string is passed to you from an API that uses
> permissive decoding lets you avoid both unexpected UnicodeEncodeError
> exceptions (if the surrogates end up being encoded with an error
> handler other than surrogatepass or surrogateescape) or propagating
> mojibake (if the surrogates are encoded with a suitable error handler,
> but an encoding that differs from the original).

Considering this rationale further, the key purpose of the proposed
new surrogate handling functions is to take an input string that may
contain surrogate code points, and produce one that is guaranteed
*not* to contain such surrogates (either because they've been removed
or replaced, or because an exception will be thrown if there are any
present in the input). They're designed to let a developer either make
a program eagerly detect improperly decoded data, or else to convert
the surrogates to an encodable form (potentially losing data in the
process)

Three potential expected sources of surrogates have been identified:

* escaped surrogates smuggling arbitrary bytes passed through decoding
by the "surrogateescape" error handler
* surrogates passed through the decoding process by the
"surrogatepass" error handler
* decomposed surrogate pairs for astral characters

The various reasonable "data scrubbing" techniques that have been proposed are:

1. compose surrogate pairs to the corresponding astral code point
2. throw an error for any surrogates found
3. delete any surrogates found
4. replace any surrogates found with the Unicode replacement character
5. replace any surrogates found with their corresponding backslash
escaped sequence
6. as with the preceding, but only for surrogate escaped data, not
arbitrary surrogates

The first of those is handled by the suggested
"compose_surrogate_pairs()", which will convert valid pairs to their
corresponding astral code points.

2-5 are handled by rehandle_surrogatepass(), with the corresponding
decoding error handler (strict, ignore, replace, backslashreplace)
6 is handled by rehandle_surrogateescape(), again with the
corresponding error handlers

A potential downside of this approach of exposing the error handlers
directly as part of the data scrubbing API is that passing in
"surrogateescape" or "surrogatepass" as the error handler may break
the assurance that the output doesn't contain any surrogates (this
could be avoided if those two error handlers don't support str->str
conversions).

Anyway, I think we can readily put this question aside for now, and
revisit it again for 3.6 after folks have a chance to get more
experience with some of the other bytes/text handling changes in 3.5.
I created a tracking issue (http://bugs.python.org/issue22555) for
those a while back, and just did a pass through them the other day to
see if there were any I particularly wanted to see make it into 3.5
(all the still open ones ended up in the "wait for other developments
before pursuing further" category).

Cheers,
Nick.