[Python-ideas] Processing surrogates in

Serhiy Storchaka storchaka at gmail.com
Mon May 4 21:12:34 CEST 2015


On 04.05.15 21:18, MRAB wrote:
> On 2015-05-04 09:15, Serhiy Storchaka wrote:
>> * rehandle_surrogatepass(string, errors)
>>
>> Handles surrogate characters (U+D800-U+DFFF) with specified error
>> handler. E.g.
>>
>>     rehandle_surrogatepass('ä\udcba', 'strict') -> error
>>     rehandle_surrogatepass('ä\udcba', 'ignore') -> 'ä'
>>     rehandle_surrogatepass('ä\udcba', 'replace') -> 'ä\ufffd'
>>     rehandle_surrogatepass('ä\udcba', 'backslashreplace') -> 'ä\\udcba'
>>
>> * rehandle_surrogateescape(string, errors)
>>
>> Handles non-ASCII bytes encoded with surrogate characters in range
>> U+DC80-U+DCFF with specified error handler. Surrogate characters outside
>> of range U+DC80-U+DCFF cause error. E.g.
>>
>>     rehandle_surrogateescape('ä\udcba', 'strict') -> error
>>     rehandle_surrogateescape('ä\udcba', 'ignore') -> 'ä'
>>     rehandle_surrogateescape('ä\udcba', 'replace') -> 'ä\ufffd'
>>     rehandle_surrogateescape('ä\udcba', 'backslashreplace') -> 'ä\\xba'
>>
> It looks like the first 3 are the same as rehandle_surrogatepass, so
> couldn't they be merged somehow?
>
>      handle_surrogates('ä\udcba', 'strict') -> error
>      handle_surrogates('ä\udcba', 'ignore') -> 'ä'
>      handle_surrogates('ä\udcba', 'replace') -> 'ä\ufffd'
>      handle_surrogates('ä\udcba', 'backslashreplace') -> 'ä\\udcba'
>      handle_surrogates('ä\udcba', 'surrogatereplace') -> 'ä\\xba'

These functions work with arbitrary error handlers, that support 
UnicodeTranslateError (for rehandle_surrogatepass) or UnicodeDecodeError 
(for rehandle_surrogateescape). They behave differently for surrogate 
characters outside of range U+DC80-U+DCFF.
handle_surrogates() needs new error handler "surrogatereplace".

>> * compose_surrogate_pairs(string)
>>
>> Converts surrogate pairs to non-BMP characters. E.g.
>>
>>     compose_surrogate_pairs('ä\ud808\udf45') -> 'ä\U00012345'
>>
> Perhaps this should be called "compose_astrals".

May be. Or "compose_non_bmp". I have no preferences and opened this 
topic mainly for bikeshedding names.




More information about the Python-ideas mailing list