[Python-ideas] Processing surrogates in

Fri May 8 13:18:07 CEST 2015

On 05.05.15 20:28, Steven D'Aprano wrote:
> On Mon, May 04, 2015 at 11:15:47AM +0300, Serhiy Storchaka wrote:
>> Surrogate characters (U+D800-U+DFFF) are not allowed in Unicode, but
>> Python allows them in Unicode strings for different purposes.
>>
>> 1) To represent UTF-8, UTF-16 or UTF-32 encoded strings that contain
>> surrogate characters. This data can came from other programs, including
>> Python 2.
>
> Can you give a simple example of a Python 2 program that provides output
> that Python 3 will read as surrogates?

f.write(u'𝄞'[:1].encode('utf-8'))
json.dump(f, u'𝄞'[:1])
pickle.dump(f, u'𝄞'[:1])

>> 2) To represent undecodable bytes in ASCII-compatible encoding with the
>> "surrogateescape" error handlers.
>>
>> So surrogate characters can be obtained from "surrogateescape" or
>> "surrogatepass" error handlers or created manually with chr() or %c.
>>
>> Some encodings (UTF-7, unicode-escape) also allows surrogate characters.
>
> Also UTF-16, and possible others.
>
> I'm not entirely sure, but I think that this is a mistake, if not a
> bug. I think that *no* UTF encoding should allow lone surrogates to
> escape through encoding. But I not entirely sure, so I won't argue that
> now -- besides, it's irrelevant to the proposal.

UTF-7 is specified by RFC 2152 and should encode any UCS-2 character. 
unicode-escape and raw-unicode-escape should encode any Python string. 
This can't be changed.

UTF-8, UTF-16, and UTF-32 don't encode surrogates by default in current 
Python 3, but encode surrogates in Python 2. The "surrogatepass" error 
handler was added for compatibility with Python 2.

>> But on output the surrogate characters can cause fail.
>
> What do you mean by "on output"? Do you mean when printing?

Printing, writing to text file, passing to C extension, that makes 
encoding internally, etc.

>> In issue18814 proposed several functions to work with surrogate and
>> astral characters. All these functions takes a string and returns a string.
>
> I like the idea of having better surrogate and astral character
> handling, but I don't think I like your suggested API of using functions
> for this. I think this is better handled as str-to-str codecs.
>
> Unfortunately, there is still no concensus of the much-debated return of
> str-to-str and byte-to-byte codecs via the str.encode and byte.decode
> methods. At one point people were talking about adding a separate method
> (transform?) to handle them, but that seems to have been forgotten.
> Fortunately the codecs module handles them just fine:
>
> py> codecs.encode("Hello world", "rot-13")
> 'Uryyb jbeyq'
>
>
> I propose, instead of your function/method rehandle_surrogatepass(), we
> add a pair of str-to-str codecs:
>
> codecs.encode(mystring, 'remove_surrogates', errors='strict')
> codecs.encode(mystring, 'remove_astrals', errors='strict')
>
> For the first one, if the string has no surrogates, it returns the
> string unchanged. If it contains any surrogates, the error handler runs
> in the usual fashion.
>
> The second is exactly the same, except it checks for astral characters.
>
> For the avoidance of doubt:
>
> * surrogates are code points in the range U+D800 to U+DFFF inclusive;
>
> * astrals are characters from the Supplementary Multilingual Planes,
>    that is code points U+10000 and above.
>
>
> Advantage of using codecs:
>
> - there's no arguments about where to put it (is it a str method? a
>    function? in the string module? some other module? where?)
>
> - we can use the usual codec machinery, rather than duplicate it;
>
> - people already understand that codecs and error handles go together;
>
> Disadvantage:
>
> - have to use codec.encode instead of str.encode.
>
>
> It is slightly sad that there is still no entirely obvious way to call
> str-to-str codecs from the encode method, but since this is a fairly
> advanced and unusual use-case, I don't think it is a problem that we
> have to use the codecs module.

Disadvantage of using codecs is that "decoding" operation doesn't make 
sense. If use one global registry for named transformation, it should be 
separate registry and separate method (str.transform) for one-way 
str-to-str transformations. In additional to above transformations of 
surrogates, it can contain transformations "upper", "lower", "title". 
But this is separate issue.