[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)

Jeff Allen ja.py at farowl.co.uk
Wed Sep 17 09:29:20 CEST 2014


This feels like a jython-dev discussion. But anyway ...

On 17/09/2014 00:57, Stephen J. Turnbull wrote:
> The CPython representation uses trailing surrogates only[1], so it's
> never possible to interpret them as anything but non-characters -- as
> soon as you encounter them you know that it's a lone surrogate.
> Surely you can do the same.
>
> As long as the Java string manipulation functions don't check for
> surrogates, you should be fine with this representation.  Of course I
> suppose your matching functions (etc) don't check for them either, so
> you will be somewhat vulnerable to bugs due to treating them as
> characters.  But the same is true for CPython, AFAIK.
They don't check. I agree that since only the trailing surrogate code 
points are allowed, you can tell that you have one, even in the UTF-16 
form. The problem is that, if strings containing lone trailing 
surrogates are allowed, then:

u'\udc83' in u'abc\U00010083xyz'
u'abc\U00010083xyz'.endswith(u'\udc83xyz')

are both True, if implemented in the obvious way on the UTF-16 
representation. And this should not be so in Jython, which claims to be 
a wide build. (I can't actually type the second one, but I can get the 
same effect in Jython 2.7b3 via a java.lang.StringBuilder.) I believe 
that the usual string operations work correctly on the UTF-16 version of 
the string, as long as indexes are adjusted correctly.

If we think it is ok that code using such methods give the wrong answer 
when fed strings containing smuggled bytes, then isolated (trailing) 
surrogates could be allowed. It's the user's fault for calling the 
method on that data.  But I think it kinder that our implementation 
defend users from these wrong answers. In the latest state of Jython, we 
do this by rigorously preventing the construction of a PyUnicode 
containing a lone surrogate, so we can just use UTF-16 operations 
without further checks.

I'm not sure that rigour will be universally welcomed, and clearly it 
precludes PEP-383 byte smuggling.

Jeff


More information about the Python-Dev mailing list