[Python-ideas] Processing surrogates in
Serhiy Storchaka
storchaka at gmail.com
Mon May 4 10:15:47 CEST 2015
Surrogate characters (U+D800-U+DFFF) are not allowed in Unicode, but
Python allows them in Unicode strings for different purposes.
1) To represent UTF-8, UTF-16 or UTF-32 encoded strings that contain
surrogate characters. This data can came from other programs, including
Python 2.
2) To represent undecodable bytes in ASCII-compatible encoding with the
"surrogateescape" error handlers.
So surrogate characters can be obtained from "surrogateescape" or
"surrogatepass" error handlers or created manually with chr() or %c.
Some encodings (UTF-7, unicode-escape) also allows surrogate characters.
But on output the surrogate characters can cause fail.
In issue18814 proposed several functions to work with surrogate and
astral characters. All these functions takes a string and returns a string.
* rehandle_surrogatepass(string, errors)
Handles surrogate characters (U+D800-U+DFFF) with specified error
handler. E.g.
rehandle_surrogatepass('ä\udcba', 'strict') -> error
rehandle_surrogatepass('ä\udcba', 'ignore') -> 'ä'
rehandle_surrogatepass('ä\udcba', 'replace') -> 'ä\ufffd'
rehandle_surrogatepass('ä\udcba', 'backslashreplace') -> 'ä\\udcba'
* rehandle_surrogateescape(string, errors)
Handles non-ASCII bytes encoded with surrogate characters in range
U+DC80-U+DCFF with specified error handler. Surrogate characters outside
of range U+DC80-U+DCFF cause error. E.g.
rehandle_surrogateescape('ä\udcba', 'strict') -> error
rehandle_surrogateescape('ä\udcba', 'ignore') -> 'ä'
rehandle_surrogateescape('ä\udcba', 'replace') -> 'ä\ufffd'
rehandle_surrogateescape('ä\udcba', 'backslashreplace') -> 'ä\\xba'
* handle_astrals(string, errors)
Handles non-BMP characters (U+10000-U+10FFFF) with specified error
handler. E.g.
handle_astrals('ä\U00012345', 'strict') -> error
handle_astrals('ä\U00012345', 'ignore') -> 'ä'
handle_astrals('ä\U00012345', 'replace') -> 'ä\ufffd'
handle_astrals('ä\U00012345', 'backslashreplace') -> 'ä\\U00012345'
* decompose_astrals(string)
Converts non-BMP characters (U+10000-U+10FFFF) to surrogate pairs. E.g.
decompose_astrals('ä\U00012345') -> 'ä\ud808\udf45'
* compose_surrogate_pairs(string)
Converts surrogate pairs to non-BMP characters. E.g.
compose_surrogate_pairs('ä\ud808\udf45') -> 'ä\U00012345'
Function names are preliminary and discussable! Location (currently the
codecs module) is discussable. Interface is discussable.
These functions revive UnicodeTranslateError, not used currently (but
handled with several error handlers).
Proposed patch provides Python implementation in the codecs module, but
after discussion I'll provide much more efficient (O(1) in best case) C
implementation.
More information about the Python-ideas
mailing list