
Paul Moore writes:
IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform.
[...]
They lose the O(1) guarantee, but that's easily defensible as a tradeoff to conform to underlying runtime semantics.
Unfortunately, I don't think it's all that easy to defend. Absent PEP 393 or a restriction to the characters in the BMP, this is a very expensive change, easily visible to interactive users, let alone performance-hungry applications. I personally do advocate the "array of code points" definition, but I don't use IronPython or Jython so PEP 393 is as close to heaven as I expect to get. OTOH, I also use Emacsen with Mule, and I have to admit that there is a perceptible performance hit in any large (>1 MB) buffer containing non-ASCII characters vs. pure ASCII (the code unit in Mule is 1 byte). I expect that if IronPython and Jython really want to retain native, code-unit-based representations, it's going to be painful to conform to an "array of code points" specification. There may need to be a compromise of the form "Implementations SHOULD provide an implementation of str that is both O(1) in indexing and an array of code points. Code that is Unicode-ly correct in Python implementing PEP 393 will need to be ported with some effort to implementations that do not satisfy this requirement, perhaps using different algorithms or extra libraries."