[Python-ideas] Processing surrogates in

Serhiy Storchaka storchaka at gmail.com
Mon May 4 10:15:47 CEST 2015


Surrogate characters (U+D800-U+DFFF) are not allowed in Unicode, but 
Python allows them in Unicode strings for different purposes.

1) To represent UTF-8, UTF-16 or UTF-32 encoded strings that contain 
surrogate characters. This data can came from other programs, including 
Python 2.

2) To represent undecodable bytes in ASCII-compatible encoding with the 
"surrogateescape" error handlers.

So surrogate characters can be obtained from "surrogateescape" or 
"surrogatepass" error handlers or created manually with chr() or %c. 
Some encodings (UTF-7, unicode-escape) also allows surrogate characters.

But on output the surrogate characters can cause fail.

In issue18814 proposed several functions to work with surrogate and 
astral characters. All these functions takes a string and returns a string.

* rehandle_surrogatepass(string, errors)

Handles surrogate characters (U+D800-U+DFFF) with specified error 
handler. E.g.

   rehandle_surrogatepass('ä\udcba', 'strict') -> error
   rehandle_surrogatepass('ä\udcba', 'ignore') -> 'ä'
   rehandle_surrogatepass('ä\udcba', 'replace') -> 'ä\ufffd'
   rehandle_surrogatepass('ä\udcba', 'backslashreplace') -> 'ä\\udcba'

* rehandle_surrogateescape(string, errors)

Handles non-ASCII bytes encoded with surrogate characters in range 
U+DC80-U+DCFF with specified error handler. Surrogate characters outside 
of range U+DC80-U+DCFF cause error. E.g.

   rehandle_surrogateescape('ä\udcba', 'strict') -> error
   rehandle_surrogateescape('ä\udcba', 'ignore') -> 'ä'
   rehandle_surrogateescape('ä\udcba', 'replace') -> 'ä\ufffd'
   rehandle_surrogateescape('ä\udcba', 'backslashreplace') -> 'ä\\xba'

* handle_astrals(string, errors)

Handles non-BMP characters (U+10000-U+10FFFF) with specified error 
handler. E.g.

   handle_astrals('ä\U00012345', 'strict') -> error
   handle_astrals('ä\U00012345', 'ignore') -> 'ä'
   handle_astrals('ä\U00012345', 'replace') -> 'ä\ufffd'
   handle_astrals('ä\U00012345', 'backslashreplace') -> 'ä\\U00012345'

* decompose_astrals(string)

Converts non-BMP characters (U+10000-U+10FFFF) to surrogate pairs. E.g.

   decompose_astrals('ä\U00012345') -> 'ä\ud808\udf45'

* compose_surrogate_pairs(string)

Converts surrogate pairs to non-BMP characters. E.g.

   compose_surrogate_pairs('ä\ud808\udf45') -> 'ä\U00012345'

Function names are preliminary and discussable! Location (currently the 
codecs module) is discussable. Interface is discussable.

These functions revive UnicodeTranslateError, not used currently (but 
handled with several error handlers).

Proposed patch provides Python implementation in the codecs module, but 
after discussion I'll provide much more efficient (O(1) in best case) C 
implementation.



More information about the Python-ideas mailing list