[Python-Dev] Unicode exception indexing

Thu Nov 3 20:16:21 CET 2011

Le jeudi 3 novembre 2011 18:14:42, martin at v.loewis.de a écrit :
> There is a backwards compatibility issue with PEP 393 and Unicode
> exceptions: the start and end indices: are they Py_UNICODE indices, or
> code point indices?

Oh oh. That's exactly why I didn't want to start to work on this issue.
http://bugs.python.org/issue13064

In a Python error handler, exc.object[exc.start:exc.end] should be used to get 
the unencodable/undecodable substring.

In a C error handler, it depends if you use a Py_UNICODE* pointer or 
PyUnicode_Substring() / PyUnicode_READ.

Using google.fr/codesearch, I found some user error handlers implemented in 
Python:
 * straw: "html_replace"
 * Nuxeo: "latin9_fallback"
 * peerscape: "htmlentityescape"
 * pymt: "cssescape"
 * ....

I found no error implemented in C (not any call to PyCodec_RegisterError).

> So what should it be?

I suggest to use code point indices. Code point indices is also now more 
"natural" with the PEP 393.

Because it is an incompatible change, it should be documented in the PEP and 
in the "What's new in Python 3.3" document.

> As a compromise, it would be possible to convert between these indices,
> by counting the non-BMP characters that precede the index if the indices
> might differ.

I started such hack for the UTF-8 codec... It is really tricky, we should not 
do that!

> That would be expensive to compute

Yeah, O(n) should be avoided when is it possible.

--

FYI I implemented a proof-of-concept in Python of the surrogateescape error 
handler for Python 2 (for Mercurial):
https://bitbucket.org/haypo/misc/src/tip/python/surrogateescape.py

Victor