Le jeudi 3 novembre 2011 18:14:42, martin@v.loewis.de a écrit :
There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they Py_UNICODE indices, or code point indices?
Oh oh. That's exactly why I didn't want to start to work on this issue. http://bugs.python.org/issue13064 In a Python error handler, exc.object[exc.start:exc.end] should be used to get the unencodable/undecodable substring. In a C error handler, it depends if you use a Py_UNICODE* pointer or PyUnicode_Substring() / PyUnicode_READ. Using google.fr/codesearch, I found some user error handlers implemented in Python: * straw: "html_replace" * Nuxeo: "latin9_fallback" * peerscape: "htmlentityescape" * pymt: "cssescape" * .... I found no error implemented in C (not any call to PyCodec_RegisterError).
So what should it be?
I suggest to use code point indices. Code point indices is also now more "natural" with the PEP 393. Because it is an incompatible change, it should be documented in the PEP and in the "What's new in Python 3.3" document.
As a compromise, it would be possible to convert between these indices, by counting the non-BMP characters that precede the index if the indices might differ.
I started such hack for the UTF-8 codec... It is really tricky, we should not do that!
That would be expensive to compute
Yeah, O(n) should be avoided when is it possible. -- FYI I implemented a proof-of-concept in Python of the surrogateescape error handler for Python 2 (for Mercurial): https://bitbucket.org/haypo/misc/src/tip/python/surrogateescape.py Victor