[Python-Dev] Multilingual programming article on the Red Hat Developer blog
ja.py at farowl.co.uk
Sat Sep 13 00:16:30 CEST 2014
It seems like we're off topic here, but to answer all as briefly as
1. Java does not really have a Unicode type, therefore not one that
validates. It has a String type that is a sequence of UTF-16 code units.
There are some String methods and Character methods that deal with code
points represented as int. I can put any 16-bit values I like in a String.
2. With proper accounting for indices, and as long as surrogates appear
in pairs, I believe operations like find or endswith give correct
answers about the unicode, when applied to the UTF-16. This is an
attractive implementation option, and mostly what we do.
3. I'm fixing some bugs where we get it wrong beyond the BMP, and the
fix involves banning lone surrogates (completely). At present you can't
type them in literals but you can sneak them in from Java.
4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it
would have to do it the same way as CPython, as it is visible. It's not
impossible (I think), but is messy. Some are strongly against.
On 12/09/2014 16:37, Jim J. Jewett wrote:
> On September 11, 2014, Jeff Allen wrote:
>> ... "surrogateescape" is an error handler, not a codec.
> True, but I believe that is a CPython implementation detail.
> Other implementations (including jython) should implement the
> surrogatescape API, but I don't think it is important to use the
> same internal representation for the invalid bytes.
>> lone surrogates preclude a naive use of the platform string library
> Invalid input often causes problems. Are you saying that there are
> situations where the platform string library could easily handle
> invalid characters in general, but has a problem with the specific
> case of lone surrogates?
More information about the Python-Dev