[Python-ideas] Add "has_surrogates" flags to string object
Serhiy Storchaka
storchaka at gmail.com
Tue Oct 8 13:43:51 CEST 2013
08.10.13 14:38, Masklinn написав(ла):
> On 2013-10-08, at 13:17 , Serhiy Storchaka wrote:
>
>> Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states:
>>
>> * String doesn't contain surrogates.
>> * String contains surrogates.
>> * It is still unknown.
>>
>> We can combine this with "is_ascii" flag in 2-bit value:
>>
>> * String is ASCII-only (and doesn't contain surrogates).
>> * String is not ASCII-only and doesn't contain surrogates.
>> * String is not ASCII-only and contains surrogates.
>> * String is not ASCII-only and it is still unknown if it contains surrogate.
>
> Isn't that redundant with the kind under shortest form representation?
No, it isn't redundant. '\udc80' is UCS2 string with surrogate code, and
'\udc80\U00010000' is UCS4 string with surrogate code. UCS2 string
without surrogate codes can be encoded in UTF-16 by memcpy().
More information about the Python-ideas
mailing list