[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)

Thu Sep 18 06:45:36 CEST 2014

Jeff Allen writes:

 > This feels like a jython-dev discussion. But anyway ...

Well, if the same representation could be used in Jython you could
just point to PEP 383 and be done with it.

 > u'\udc83' in u'abc\U00010083xyz'

IMHO being able to type that is a bug.  There should be no literal
notation for surrogates in Python (that is, if you type a literal that
looks like it refers to a surrogate, you should get a UnicodeError).
The "right way" (IMHO) to spell that is

chr(0xdc83) in u'abc\U00010083xyz'

I'm not Guido, and don't claim to channel him on this.  But it seems
reasonable to me that Unicode literals should conform to Unicode in
this way.  I might even extend that to noncharacters (the last two
code points in each plane and the 32-point "hole" in Arabic).

I'll grant that chr() is an unfortunate spelling, but I would imagine
we could live with that since chr() goes back forever with these
semantics.

 > u'abc\U00010083xyz'.endswith(u'\udc83xyz')
 > 
 > are both True, if implemented in the obvious way on the UTF-16 
 > representation. And this should not be so in Jython, which claims to be 
 > a wide build. (I can't actually type the second one, but I can get the 
 > same effect in Jython 2.7b3 via a java.lang.StringBuilder.)

I agree that's very ugly, but AFAIK that's how things would work in
narrow CPython (which uses UTF-16 internally for the astral planes).

Personally I would document that explicit smuggled bytes are not
supported for comparison operations, and leave it at that.

 > If we think it is ok that code using such methods give the wrong answer 
 > when fed strings containing smuggled bytes, then isolated (trailing) 
 > surrogates could be allowed. It's the user's fault for calling the 
 > method on that data.  But I think it kinder that our implementation 
 > defend users from these wrong answers. In the latest state of Jython, we 
 > do this by rigorously preventing the construction of a PyUnicode 
 > containing a lone surrogate, so we can just use UTF-16 operations 
 > without further checks.

That seems like a reasonable approach.