[Python-ideas] Processing surrogates in

Tue May 5 19:28:46 CEST 2015

On Mon, May 04, 2015 at 11:15:47AM +0300, Serhiy Storchaka wrote:
> Surrogate characters (U+D800-U+DFFF) are not allowed in Unicode, but 
> Python allows them in Unicode strings for different purposes.
> 
> 1) To represent UTF-8, UTF-16 or UTF-32 encoded strings that contain 
> surrogate characters. This data can came from other programs, including 
> Python 2.

Can you give a simple example of a Python 2 program that provides output 
that Python 3 will read as surrogates?

> 2) To represent undecodable bytes in ASCII-compatible encoding with the 
> "surrogateescape" error handlers.
> 
> So surrogate characters can be obtained from "surrogateescape" or 
> "surrogatepass" error handlers or created manually with chr() or %c. 
>
> Some encodings (UTF-7, unicode-escape) also allows surrogate characters.

Also UTF-16, and possible others. 

I'm not entirely sure, but I think that this is a mistake, if not a 
bug. I think that *no* UTF encoding should allow lone surrogates to 
escape through encoding. But I not entirely sure, so I won't argue that 
now -- besides, it's irrelevant to the proposal.

> But on output the surrogate characters can cause fail.

What do you mean by "on output"? Do you mean when printing?

> In issue18814 proposed several functions to work with surrogate and 
> astral characters. All these functions takes a string and returns a string.

I like the idea of having better surrogate and astral character 
handling, but I don't think I like your suggested API of using functions 
for this. I think this is better handled as str-to-str codecs.

Unfortunately, there is still no concensus of the much-debated return of 
str-to-str and byte-to-byte codecs via the str.encode and byte.decode 
methods. At one point people were talking about adding a separate method 
(transform?) to handle them, but that seems to have been forgotten. 
Fortunately the codecs module handles them just fine:

py> codecs.encode("Hello world", "rot-13")
'Uryyb jbeyq'

I propose, instead of your function/method rehandle_surrogatepass(), we 
add a pair of str-to-str codecs:

codecs.encode(mystring, 'remove_surrogates', errors='strict')
codecs.encode(mystring, 'remove_astrals', errors='strict')

For the first one, if the string has no surrogates, it returns the 
string unchanged. If it contains any surrogates, the error handler runs 
in the usual fashion.

The second is exactly the same, except it checks for astral characters.

For the avoidance of doubt:

* surrogates are code points in the range U+D800 to U+DFFF inclusive;

* astrals are characters from the Supplementary Multilingual Planes, 
  that is code points U+10000 and above.

Advantage of using codecs:

- there's no arguments about where to put it (is it a str method? a 
  function? in the string module? some other module? where?)

- we can use the usual codec machinery, rather than duplicate it;

- people already understand that codecs and error handles go together;

Disadvantage:

- have to use codec.encode instead of str.encode.

It is slightly sad that there is still no entirely obvious way to call 
str-to-str codecs from the encode method, but since this is a fairly 
advanced and unusual use-case, I don't think it is a problem that we 
have to use the codecs module.

> * decompose_astrals(string)
> * compose_surrogate_pairs(string)

I'm not sure about those. I have to think about them.

-- 
Steve