[Python-ideas] Processing surrogates in

Mon May 4 20:18:32 CEST 2015

On 2015-05-04 09:15, Serhiy Storchaka wrote:
> Surrogate characters (U+D800-U+DFFF) are not allowed in Unicode, but
> Python allows them in Unicode strings for different purposes.
>
> 1) To represent UTF-8, UTF-16 or UTF-32 encoded strings that contain
> surrogate characters. This data can came from other programs, including
> Python 2.
>
> 2) To represent undecodable bytes in ASCII-compatible encoding with the
> "surrogateescape" error handlers.
>
> So surrogate characters can be obtained from "surrogateescape" or
> "surrogatepass" error handlers or created manually with chr() or %c.
> Some encodings (UTF-7, unicode-escape) also allows surrogate characters.
>
> But on output the surrogate characters can cause fail.
>
> In issue18814 proposed several functions to work with surrogate and
> astral characters. All these functions takes a string and returns a string.
>
> * rehandle_surrogatepass(string, errors)
>
> Handles surrogate characters (U+D800-U+DFFF) with specified error
> handler. E.g.
>
>     rehandle_surrogatepass('ä\udcba', 'strict') -> error
>     rehandle_surrogatepass('ä\udcba', 'ignore') -> 'ä'
>     rehandle_surrogatepass('ä\udcba', 'replace') -> 'ä\ufffd'
>     rehandle_surrogatepass('ä\udcba', 'backslashreplace') -> 'ä\\udcba'
>
> * rehandle_surrogateescape(string, errors)
>
> Handles non-ASCII bytes encoded with surrogate characters in range
> U+DC80-U+DCFF with specified error handler. Surrogate characters outside
> of range U+DC80-U+DCFF cause error. E.g.
>
>     rehandle_surrogateescape('ä\udcba', 'strict') -> error
>     rehandle_surrogateescape('ä\udcba', 'ignore') -> 'ä'
>     rehandle_surrogateescape('ä\udcba', 'replace') -> 'ä\ufffd'
>     rehandle_surrogateescape('ä\udcba', 'backslashreplace') -> 'ä\\xba'
>
It looks like the first 3 are the same as rehandle_surrogatepass, so
couldn't they be merged somehow?

     handle_surrogates('ä\udcba', 'strict') -> error
     handle_surrogates('ä\udcba', 'ignore') -> 'ä'
     handle_surrogates('ä\udcba', 'replace') -> 'ä\ufffd'
     handle_surrogates('ä\udcba', 'backslashreplace') -> 'ä\\udcba'
     handle_surrogates('ä\udcba', 'surrogatereplace') -> 'ä\\xba'

> * handle_astrals(string, errors)
>
> Handles non-BMP characters (U+10000-U+10FFFF) with specified error
> handler. E.g.
>
>     handle_astrals('ä\U00012345', 'strict') -> error
>     handle_astrals('ä\U00012345', 'ignore') -> 'ä'
>     handle_astrals('ä\U00012345', 'replace') -> 'ä\ufffd'
>     handle_astrals('ä\U00012345', 'backslashreplace') -> 'ä\\U00012345'
>
> * decompose_astrals(string)
>
> Converts non-BMP characters (U+10000-U+10FFFF) to surrogate pairs. E.g.
>
>     decompose_astrals('ä\U00012345') -> 'ä\ud808\udf45'
>
> * compose_surrogate_pairs(string)
>
> Converts surrogate pairs to non-BMP characters. E.g.
>
>     compose_surrogate_pairs('ä\ud808\udf45') -> 'ä\U00012345'
>
Perhaps this should be called "compose_astrals".

> Function names are preliminary and discussable! Location (currently the
> codecs module) is discussable. Interface is discussable.
>
> These functions revive UnicodeTranslateError, not used currently (but
> handled with several error handlers).
>
> Proposed patch provides Python implementation in the codecs module, but
> after discussion I'll provide much more efficient (O(1) in best case) C
> implementation.
>