[Python-ideas] Add "has_surrogates" flags to string object

Serhiy Storchaka storchaka at gmail.com
Tue Oct 8 13:43:51 CEST 2013


08.10.13 14:38, Masklinn написав(ла):
> On 2013-10-08, at 13:17 , Serhiy Storchaka wrote:
>
>> Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states:
>>
>> * String doesn't contain surrogates.
>> * String contains surrogates.
>> * It is still unknown.
>>
>> We can combine this with "is_ascii" flag in 2-bit value:
>>
>> * String is ASCII-only (and doesn't contain surrogates).
>> * String is not ASCII-only and doesn't contain surrogates.
>> * String is not ASCII-only and contains surrogates.
>> * String is not ASCII-only and it is still unknown if it contains surrogate.
>
> Isn't that redundant with the kind under shortest form representation?

No, it isn't redundant. '\udc80' is UCS2 string with surrogate code, and 
'\udc80\U00010000' is UCS4 string with surrogate code. UCS2 string 
without surrogate codes can be encoded in UTF-16 by memcpy().



More information about the Python-ideas mailing list