[Python-ideas] Processing surrogates in

Serhiy Storchaka storchaka at gmail.com
Thu May 14 12:15:18 CEST 2015


On 14.05.15 11:48, Andrew Barnert via Python-ideas wrote:
> On Wednesday, May 13, 2015 10:31 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
>> On 13.05.15 19:22, Nick Coghlan wrote:
>>>   Three potential expected sources of surrogates have been identified:
>>>
>>>   * escaped surrogates smuggling arbitrary bytes passed through decoding
>>>   by the "surrogateescape" error handler
>>>   * surrogates passed through the decoding process by the
>>>   "surrogatepass" error handler
>>>   * decomposed surrogate pairs for astral characters
>>
>> * json
>> * pickle
>> * email
>> * nntplib
>> * SimpleHTTPRequestHandler
>> * wsgiref
>> * cgi
>> * tarfile
>> * filesystem names (os.decode) and other os calls
>> * platform and sysconfig
>> * other serializers
>
> As far as I can tell, all of your extra cases are just examples of the surrogateescape error handler, which Nick already mentioned.

Not all. JSON allows to inject surrogates as \uXXXX. Pickle with 
protocol 0 uses the raw-unicode-escape encoding that allows surrogates.

There is also the UTF-7 encoding that allows surrogates. And yet one 
source of surrogates -- Python sources. eval(), etc.

Tkinter can produce surrogates. XML parser unfortunately can't 
(unfortunately - because it makes impossible to handle with Python some 
files generated by third-party programs). I'm not sure about sqlite3. 
Any extension module, any wrapper around third-party library could 
potentially produce surrogates.




More information about the Python-ideas mailing list