[Python-Dev] Unicode exception indexing

Thu Nov 3 22:09:37 CET 2011

On Thu, Nov 3, 2011 at 12:29 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Thu, 03 Nov 2011 18:14:42 +0100
> martin at v.loewis.de wrote:
>> There is a backwards compatibility issue with PEP 393 and Unicode exceptions:
>> the start and end indices: are they Py_UNICODE indices, or code point indices?
>>
>> On the one hand, these indices are used in formatting error messages such as
>> "codec can't encode character \u%04x in position %d", suggesting they
>> are regular
>> indices into the string (counting code points).
>>
>> On the other hand, they are used by error handlers to lookup the character,
>> and existing error handlers (including the ones we have now) use
>> PyUnicode_AsUnicode to find the character. This suggests that the indices
>> should be Py_UNICODE indices, for compatibility (and they currently do
>> work in this way).
>
> But what about error handlers written in Python?
>
>> The indices can only be different if the string is an UCS-4 string, and
>> Py_UNICODE is a two-byte type (i.e. on Windows).
>>
>> So what should it be?
>
> I'd say let's do the Right Thing and accept the small compatibility
> breach (surrogates on UCS-2 builds).

+1

-- 
--Guido van Rossum (python.org/~guido)