[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)
ja.py at farowl.co.uk
Wed Sep 17 09:29:20 CEST 2014
This feels like a jython-dev discussion. But anyway ...
On 17/09/2014 00:57, Stephen J. Turnbull wrote:
> The CPython representation uses trailing surrogates only, so it's
> never possible to interpret them as anything but non-characters -- as
> soon as you encounter them you know that it's a lone surrogate.
> Surely you can do the same.
> As long as the Java string manipulation functions don't check for
> surrogates, you should be fine with this representation. Of course I
> suppose your matching functions (etc) don't check for them either, so
> you will be somewhat vulnerable to bugs due to treating them as
> characters. But the same is true for CPython, AFAIK.
They don't check. I agree that since only the trailing surrogate code
points are allowed, you can tell that you have one, even in the UTF-16
form. The problem is that, if strings containing lone trailing
surrogates are allowed, then:
u'\udc83' in u'abc\U00010083xyz'
are both True, if implemented in the obvious way on the UTF-16
representation. And this should not be so in Jython, which claims to be
a wide build. (I can't actually type the second one, but I can get the
same effect in Jython 2.7b3 via a java.lang.StringBuilder.) I believe
that the usual string operations work correctly on the UTF-16 version of
the string, as long as indexes are adjusted correctly.
If we think it is ok that code using such methods give the wrong answer
when fed strings containing smuggled bytes, then isolated (trailing)
surrogates could be allowed. It's the user's fault for calling the
method on that data. But I think it kinder that our implementation
defend users from these wrong answers. In the latest state of Jython, we
do this by rigorously preventing the construction of a PyUnicode
containing a lone surrogate, so we can just use UTF-16 operations
without further checks.
I'm not sure that rigour will be universally welcomed, and clearly it
precludes PEP-383 byte smuggling.
More information about the Python-Dev