[Python-ideas] Processing surrogates in

Andrew Barnert abarnert at yahoo.com
Thu May 14 10:48:42 CEST 2015


On Wednesday, May 13, 2015 10:31 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:

> On 13.05.15 19:22, Nick Coghlan wrote:
>>  Three potential expected sources of surrogates have been identified:
>> 
>>  * escaped surrogates smuggling arbitrary bytes passed through decoding
>>  by the "surrogateescape" error handler
>>  * surrogates passed through the decoding process by the
>>  "surrogatepass" error handler
>>  * decomposed surrogate pairs for astral characters
> 
> * json
> * pickle
> * email
> * nntplib
> * SimpleHTTPRequestHandler
> * wsgiref
> * cgi
> * tarfile
> * filesystem names (os.decode) and other os calls
> * platform and sysconfig
> * other serializers

As far as I can tell, all of your extra cases are just examples of the surrogateescape error handler, which Nick already mentioned.


Beyond that, some of these modules may need to understand surrogates internally, but I can't see how they could get anywhere near the module boundaries. For example, to build and parse JSON's 12-character escape sequences, like "\uD834\uDD1E" for U+1D11E, you obviously need to be able to decompose and compose astrals internally, but that shouldn't even generate unicode strings with surrogate pairs in 3.3+, much less expose them to user code.


More information about the Python-ideas mailing list