[Python-ideas] Processing surrogates in

Thu May 14 14:21:10 CEST 2015

On May 14, 2015, at 03:15, Serhiy Storchaka <storchaka at gmail.com> wrote:
> 
>> On 14.05.15 11:48, Andrew Barnert via Python-ideas wrote:
>>> On Wednesday, May 13, 2015 10:31 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
>>>> On 13.05.15 19:22, Nick Coghlan wrote:
>>>>  Three potential expected sources of surrogates have been identified:
>>>> 
>>>>  * escaped surrogates smuggling arbitrary bytes passed through decoding
>>>>  by the "surrogateescape" error handler
>>>>  * surrogates passed through the decoding process by the
>>>>  "surrogatepass" error handler
>>>>  * decomposed surrogate pairs for astral characters
>>> 
>>> * json
>>> * pickle
>>> * email
>>> * nntplib
>>> * SimpleHTTPRequestHandler
>>> * wsgiref
>>> * cgi
>>> * tarfile
>>> * filesystem names (os.decode) and other os calls
>>> * platform and sysconfig
>>> * other serializers
>> 
>> As far as I can tell, all of your extra cases are just examples of the surrogateescape error handler, which Nick already mentioned.
> 
> Not all. JSON allows to inject surrogates as \uXXXX.

JSON specifically requires treating \uXXXX\uYYYY as a "12-character escape sequence" for a single character if XXXX and YYYY are a surrogate pair. If Python is handling that wrong, then it needs to be fixed (but I don't think it is; I'll test tomorrow).

> Pickle with protocol 0 uses the raw-unicode-escape encoding that allows surrogates.

Sure, if you pickle a unicode object in a narrow 2.x, it gets pickled as surrogates. But when you unpickle it in 3.4, surely those surrogates are converted to astrals? If not, then every time you, e.g., pickle a Windows filename for use with win32api with astrals in 2.x, and unpickle it in 3.4 and try to use it with win32api it wouldn't work. Unless we actually are breaking those filenames, but win32api (and everything else) is working around the problem? Even if that's true, it seems like the obvious answer would be to fix the problem rather than provide tools for workarounds to libraries that must already have those workarounds anyway.

> There is also the UTF-7 encoding that allows surrogates.

Encoding to UTF-7 requires first encoding to UTF-16 and then doing the modified-base-64 thing. And decoding from UTF-7 requires reversing both those steps. There's no way surrogates can escape into Unicode from that. I suppose you could, instead of decoding from UTF-7, just do the base 64 decode and then skip the UTF-16 decode and instead just widen the code units, but that's not a valid thing to do, and I can't see why anyone would do it.

> And yet one source of surrogates -- Python sources. eval(), etc.

If I type '\uD834\uDD1E' in Python 3.4 source, am I actually going to get an illegal Unicode string made of 2 surrogate code points instead of either an error or the single-character string '\U0001D11E'?

If so, again, I think that's a bug that needs to be fixed, not worked around. There's no legitimate reason for any source code to expect that to be an illegal length-2 string.

> Tkinter can produce surrogates. XML parser unfortunately can't (unfortunately - because it makes impossible to handle with Python some files generated by third-party programs). I'm not sure about sqlite3. Any extension module, any wrapper around third-party library could potentially produce surrogates.

What C API function are they calling to make a PyUnicode out of a UTF-16 char* or wchar_t* or whatever without decoding it as UTF-16? And why do we have such a function?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150514/a83a78cf/attachment-0001.html>