[Python-ideas] Processing surrogates in
Serhiy Storchaka
storchaka at gmail.com
Mon May 4 21:12:34 CEST 2015
On 04.05.15 21:18, MRAB wrote:
> On 2015-05-04 09:15, Serhiy Storchaka wrote:
>> * rehandle_surrogatepass(string, errors)
>>
>> Handles surrogate characters (U+D800-U+DFFF) with specified error
>> handler. E.g.
>>
>> rehandle_surrogatepass('ä\udcba', 'strict') -> error
>> rehandle_surrogatepass('ä\udcba', 'ignore') -> 'ä'
>> rehandle_surrogatepass('ä\udcba', 'replace') -> 'ä\ufffd'
>> rehandle_surrogatepass('ä\udcba', 'backslashreplace') -> 'ä\\udcba'
>>
>> * rehandle_surrogateescape(string, errors)
>>
>> Handles non-ASCII bytes encoded with surrogate characters in range
>> U+DC80-U+DCFF with specified error handler. Surrogate characters outside
>> of range U+DC80-U+DCFF cause error. E.g.
>>
>> rehandle_surrogateescape('ä\udcba', 'strict') -> error
>> rehandle_surrogateescape('ä\udcba', 'ignore') -> 'ä'
>> rehandle_surrogateescape('ä\udcba', 'replace') -> 'ä\ufffd'
>> rehandle_surrogateescape('ä\udcba', 'backslashreplace') -> 'ä\\xba'
>>
> It looks like the first 3 are the same as rehandle_surrogatepass, so
> couldn't they be merged somehow?
>
> handle_surrogates('ä\udcba', 'strict') -> error
> handle_surrogates('ä\udcba', 'ignore') -> 'ä'
> handle_surrogates('ä\udcba', 'replace') -> 'ä\ufffd'
> handle_surrogates('ä\udcba', 'backslashreplace') -> 'ä\\udcba'
> handle_surrogates('ä\udcba', 'surrogatereplace') -> 'ä\\xba'
These functions work with arbitrary error handlers, that support
UnicodeTranslateError (for rehandle_surrogatepass) or UnicodeDecodeError
(for rehandle_surrogateescape). They behave differently for surrogate
characters outside of range U+DC80-U+DCFF.
handle_surrogates() needs new error handler "surrogatereplace".
>> * compose_surrogate_pairs(string)
>>
>> Converts surrogate pairs to non-BMP characters. E.g.
>>
>> compose_surrogate_pairs('ä\ud808\udf45') -> 'ä\U00012345'
>>
> Perhaps this should be called "compose_astrals".
May be. Or "compose_non_bmp". I have no preferences and opened this
topic mainly for bikeshedding names.
More information about the Python-ideas
mailing list