[Python-ideas] Processing surrogates in

Thu May 14 21:38:19 CEST 2015

On May 14, 2015, at 07:38, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> 
> Andrew Barnert via Python-ideas writes:
> 
>>> And yet one source of surrogates -- Python sources. eval(), etc.
> 
> Yep:
> 
> $ python3.4
> Python 3.4.3 (default, Mar 10 2015, 14:53:35) 
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> chr((16*13+8)*256)
> '\ud800'
>>>> '\ud800'
> '\ud800'
>>>> '\ud834\udd1e'
> '\ud834\udd1e'
> 
>> If I type '\uD834\uDD1E' in Python 3.4 source, am I actually going
>> to get an illegal Unicode string made of 2 surrogate code points
>> instead of either an error or the single-character string
>> '\U0001D11E'?
> 
> Yes.  How else do you propose to test the surrogateescape error
> handler?  Now, are you sitting down?  If not, you should before
> looking at the next example. ;-)
> 
>>>> '\U0000d834\U0000dd1e'
> '\ud834\udd1e'
> 
> Isn't that disgusting?  

No; if the former gave you surrogates, the latter pretty much has to. Otherwise, that would essentially mean you can create illegal strings by accident, but it's hard to create them in the obvious explicitly intentional way. (The other way around might be reasonable, however.)

At any rate, I can see that allowing people to go out of their way to create invalid strings is potentially useful (for testing invalid string handling, if nothing else) and possibly a "consenting adults" issue even if it weren't. So maybe that's the one case from the list that isn't just an example of Nick's three general cases.

But meanwhile: if you're intentionally writing literals for invalid strings to test for invalid string handling, is that an argument for this proposal? For example, I might want to test that some fast JSON library does the same thing as the stdlib one in all cases; if there's a text-to-text codec in front of it, that makes the test a lot harder to write.

So I think it still comes down to what Nick said: if you've got surrogates in your unicode, either you have a bug at your boundaries, you're dealing with surrogate escapes, or I forget the third... Or you're doing it intentionally and don't want to fix it. (And, although you didn't re-raise it, Serhiy mentioned eval, so let me just say that something like "I called eval on some arbitrary string that happened to be JSON not Python" sounds like a bug at the boundaries case, not a separate problem.)