[Python-Dev] Unicode exception indexing
victor.stinner at haypocalc.com
Thu Nov 3 20:16:21 CET 2011
Le jeudi 3 novembre 2011 18:14:42, martin at v.loewis.de a écrit :
> There is a backwards compatibility issue with PEP 393 and Unicode
> exceptions: the start and end indices: are they Py_UNICODE indices, or
> code point indices?
Oh oh. That's exactly why I didn't want to start to work on this issue.
In a Python error handler, exc.object[exc.start:exc.end] should be used to get
the unencodable/undecodable substring.
In a C error handler, it depends if you use a Py_UNICODE* pointer or
PyUnicode_Substring() / PyUnicode_READ.
Using google.fr/codesearch, I found some user error handlers implemented in
* straw: "html_replace"
* Nuxeo: "latin9_fallback"
* peerscape: "htmlentityescape"
* pymt: "cssescape"
I found no error implemented in C (not any call to PyCodec_RegisterError).
> So what should it be?
I suggest to use code point indices. Code point indices is also now more
"natural" with the PEP 393.
Because it is an incompatible change, it should be documented in the PEP and
in the "What's new in Python 3.3" document.
> As a compromise, it would be possible to convert between these indices,
> by counting the non-BMP characters that precede the index if the indices
> might differ.
I started such hack for the UTF-8 codec... It is really tricky, we should not
> That would be expensive to compute
Yeah, O(n) should be avoided when is it possible.
FYI I implemented a proof-of-concept in Python of the surrogateescape error
handler for Python 2 (for Mercurial):
More information about the Python-Dev