[Python-Dev] Unicode exception indexing

Nov. 3, 2011

      There is a backwards compatibility issue with PEP 393 and Unicode exceptions:
the start and end indices: are they Py_UNICODE indices, or code point indices?

On the one hand, these indices are used in formatting error messages such as
"codec can't encode character \u%04x in position %d", suggesting they  
are regular
indices into the string (counting code points).

On the other hand, they are used by error handlers to lookup the character,
and existing error handlers (including the ones we have now) use
PyUnicode_AsUnicode to find the character. This suggests that the indices
should be Py_UNICODE indices, for compatibility (and they currently do
work in this way).

The indices can only be different if the string is an UCS-4 string, and
Py_UNICODE is a two-byte type (i.e. on Windows).

So what should it be?

As a compromise, it would be possible to convert between these indices,
by counting the non-BMP characters that precede the index if the indices
might differ. That would be expensive to compute, but provide backwards
compatibility to the C API. It's less clear what backwards compatibility
to Python code would require - most likely, people would use the indices
for slicing operations (rather than performing an UTF-16 conversion and
performing indexing on that).

Regards,
Martin

[Python-Dev] Unicode exception indexing

martin＠v.loewis.de