Unicode exception indexing

There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they Py_UNICODE indices, or code point indices? On the one hand, these indices are used in formatting error messages such as "codec can't encode character \u%04x in position %d", suggesting they are regular indices into the string (counting code points). On the other hand, they are used by error handlers to lookup the character, and existing error handlers (including the ones we have now) use PyUnicode_AsUnicode to find the character. This suggests that the indices should be Py_UNICODE indices, for compatibility (and they currently do work in this way). The indices can only be different if the string is an UCS-4 string, and Py_UNICODE is a two-byte type (i.e. on Windows). So what should it be? As a compromise, it would be possible to convert between these indices, by counting the non-BMP characters that precede the index if the indices might differ. That would be expensive to compute, but provide backwards compatibility to the C API. It's less clear what backwards compatibility to Python code would require - most likely, people would use the indices for slicing operations (rather than performing an UTF-16 conversion and performing indexing on that). Regards, Martin

Le jeudi 3 novembre 2011 18:14:42, martin@v.loewis.de a écrit :
There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they Py_UNICODE indices, or code point indices?
Oh oh. That's exactly why I didn't want to start to work on this issue. http://bugs.python.org/issue13064 In a Python error handler, exc.object[exc.start:exc.end] should be used to get the unencodable/undecodable substring. In a C error handler, it depends if you use a Py_UNICODE* pointer or PyUnicode_Substring() / PyUnicode_READ. Using google.fr/codesearch, I found some user error handlers implemented in Python: * straw: "html_replace" * Nuxeo: "latin9_fallback" * peerscape: "htmlentityescape" * pymt: "cssescape" * .... I found no error implemented in C (not any call to PyCodec_RegisterError).
So what should it be?
I suggest to use code point indices. Code point indices is also now more "natural" with the PEP 393. Because it is an incompatible change, it should be documented in the PEP and in the "What's new in Python 3.3" document.
As a compromise, it would be possible to convert between these indices, by counting the non-BMP characters that precede the index if the indices might differ.
I started such hack for the UTF-8 codec... It is really tricky, we should not do that!
That would be expensive to compute
Yeah, O(n) should be avoided when is it possible. -- FYI I implemented a proof-of-concept in Python of the surrogateescape error handler for Python 2 (for Mercurial): https://bitbucket.org/haypo/misc/src/tip/python/surrogateescape.py Victor

On 11/3/2011 3:16 PM, Victor Stinner wrote:
Le jeudi 3 novembre 2011 18:14:42, martin@v.loewis.de a écrit :
There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they Py_UNICODE indices, or code point indices?
I had the impression that we were abolishing the wide versus narrow build difference and that this issue would disappear. I must have missed something.
So what should it be?
I suggest to use code point indices. Code point indices is also now more "natural" with the PEP 393.
I think we should look forward, not backwards. Error messages are defined as undefined ;-). So I think we should do what is right for the new implementation. I suspect that means that I am agreeing with both Victor and Antoine.
Because it is an incompatible change, it should be documented in the PEP and in the "What's new in Python 3.3" document. ... Yeah, O(n) should be avoided when is it possible.
Definitely to both. -- Terry Jan Reedy

Am 03.11.2011 22:19, schrieb Terry Reedy:
On 11/3/2011 3:16 PM, Victor Stinner wrote:
Le jeudi 3 novembre 2011 18:14:42, martin@v.loewis.de a écrit :
There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they Py_UNICODE indices, or code point indices?
I had the impression that we were abolishing the wide versus narrow build difference and that this issue would disappear. I must have missed something.
Most certainly. The Py_UNICODE type continues to exist for backwards compatibility. It is now always a typedef for wchar_t, which makes it a 16-bit type on Windows. Regards, Martin

On 11/3/2011 5:43 PM, "Martin v. Löwis" wrote:
I had the impression that we were abolishing the wide versus narrow build difference and that this issue would disappear. I must have missed something.
Most certainly. The Py_UNICODE type continues to exist for backwards compatibility. It is now always a typedef for wchar_t, which makes it a 16-bit type on Windows.
Thank you for answering: My revised impression now is that any string I create with Python code in Python 3.3+ (as distributed, without extensions or ctypes calls) will use the new implementation and will index and and slice correctly, even with extended chars. So indexing is only an issue for those writing or using C-coded extensions with the old unicode C-API on systems with a 16-bit wchar_t. Correct? --- Terry Jan Reedy

I started such hack for the UTF-8 codec... It is really tricky, we should not do that!
With the proper encapsulation, it's not that tricky. I have written functions PyUnicode_IndexToWCharIndex and PyUnicode_WCharIndexToIndex, and PyUnicodeEncodeError_GetStart and friends would use that function. I'd also need new functions PyUnicodeEncodeError_GetStartIndex to access the "true" start field.
That would be expensive to compute
Yeah, O(n) should be avoided when is it possible.
Ok. I'll wait half a day or so for people to reconsider (now knowing that it's actually feasible to be fully backwards compatible); if nobody speaks up, I go ahead and accept the breakage. Regards, Martin

Your approach (doing the right thing for both Python and C, new API to avoid the C performance problem) sounds good to me. -- Nick Coghlan (via Gmail on Android, so likely to be more terse than usual) On Nov 4, 2011 7:58 AM, Martin v. Löwis <martin@v.loewis.de> wrote:
I started such hack for the UTF-8 codec... It is really tricky, we should not do that!
With the proper encapsulation, it's not that tricky. I have written functions PyUnicode_IndexToWCharIndex and PyUnicode_WCharIndexToIndex, and PyUnicodeEncodeError_GetStart and friends would use that function. I'd also need new functions PyUnicodeEncodeError_GetStartIndex to access the "true" start field.
That would be expensive to compute
Yeah, O(n) should be avoided when is it possible.
Ok. I'll wait half a day or so for people to reconsider (now knowing that it's actually feasible to be fully backwards compatible); if nobody speaks up, I go ahead and accept the breakage.
Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On Thu, 03 Nov 2011 18:14:42 +0100 martin@v.loewis.de wrote:
There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they Py_UNICODE indices, or code point indices?
On the one hand, these indices are used in formatting error messages such as "codec can't encode character \u%04x in position %d", suggesting they are regular indices into the string (counting code points).
On the other hand, they are used by error handlers to lookup the character, and existing error handlers (including the ones we have now) use PyUnicode_AsUnicode to find the character. This suggests that the indices should be Py_UNICODE indices, for compatibility (and they currently do work in this way).
But what about error handlers written in Python?
The indices can only be different if the string is an UCS-4 string, and Py_UNICODE is a two-byte type (i.e. on Windows).
So what should it be?
I'd say let's do the Right Thing and accept the small compatibility breach (surrogates on UCS-2 builds). Regards Antoine.

On Thu, Nov 3, 2011 at 12:29 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Thu, 03 Nov 2011 18:14:42 +0100 martin@v.loewis.de wrote:
There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they Py_UNICODE indices, or code point indices?
On the one hand, these indices are used in formatting error messages such as "codec can't encode character \u%04x in position %d", suggesting they are regular indices into the string (counting code points).
On the other hand, they are used by error handlers to lookup the character, and existing error handlers (including the ones we have now) use PyUnicode_AsUnicode to find the character. This suggests that the indices should be Py_UNICODE indices, for compatibility (and they currently do work in this way).
But what about error handlers written in Python?
The indices can only be different if the string is an UCS-4 string, and Py_UNICODE is a two-byte type (i.e. on Windows).
So what should it be?
I'd say let's do the Right Thing and accept the small compatibility breach (surrogates on UCS-2 builds).
+1 -- --Guido van Rossum (python.org/~guido)

On the one hand, these indices are used in formatting error messages such as "codec can't encode character \u%04x in position %d", suggesting they are regular indices into the string (counting code points).
On the other hand, they are used by error handlers to lookup the character, and existing error handlers (including the ones we have now) use PyUnicode_AsUnicode to find the character. This suggests that the indices should be Py_UNICODE indices, for compatibility (and they currently do work in this way).
But what about error handlers written in Python?
I'm working on a patch where an C error handler using PyUnicodeEncodeError_GetStart gets a different value than a Python error handler accessing .start. The _GetStart/_GetEnd functions would take the value from the exception object, and adjust it before returning it. The implementation is fairly straight-forward, just a little expensive (in the case of non-BMP strings on Windows). Regards, Martin

On Thu, 03 Nov 2011 22:47:00 +0100 "Martin v. Löwis" <martin@v.loewis.de> wrote:
On the one hand, these indices are used in formatting error messages such as "codec can't encode character \u%04x in position %d", suggesting they are regular indices into the string (counting code points).
On the other hand, they are used by error handlers to lookup the character, and existing error handlers (including the ones we have now) use PyUnicode_AsUnicode to find the character. This suggests that the indices should be Py_UNICODE indices, for compatibility (and they currently do work in this way).
But what about error handlers written in Python?
I'm working on a patch where an C error handler using PyUnicodeEncodeError_GetStart gets a different value than a Python error handler accessing .start. The _GetStart/_GetEnd functions would take the value from the exception object, and adjust it before returning it.
Is it worth the hassle? We can just port our existing error handlers, and I guess the few third-party error handlers written in C (if any) can bear the transition. Regards Antoine.

Is it worth the hassle? We can just port our existing error handlers, and I guess the few third-party error handlers written in C (if any) can bear the transition.
That was my question exactly. As the author of PEP 393, I was leaning towards full backwards compatibility, but you, Victor, and Guido tell me not to worry - so I won't :-) Regards, Martin

"Martin v. Löwis", 04.11.2011 08:39:
Is it worth the hassle? We can just port our existing error handlers, and I guess the few third-party error handlers written in C (if any) can bear the transition.
That was my question exactly. As the author of PEP 393, I was leaning towards full backwards compatibility, but you, Victor, and Guido tell me not to worry - so I won't :-)
+1, FWIW. Stefan

On 11/4/2011 3:39 AM, "Martin v. Löwis" wrote:
Is it worth the hassle? We can just port our existing error handlers, and I guess the few third-party error handlers written in C (if any) can bear the transition.
That was my question exactly. As the author of PEP 393, I was leaning towards full backwards compatibility, but you, Victor, and Guido tell me not to worry - so I won't :-)
While we need to keep the old api, I do not think we do not need to encourage its continued use by actively supporting it with new code. When 3.3 comes out, I think it should be socially OK to write C code only for 3.3+ by only using the new api. -- Terry Jan Reedy
participants (8)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Guido van Rossum
-
martin@v.loewis.de
-
Nick Coghlan
-
Stefan Behnel
-
Terry Reedy
-
Victor Stinner