[Python-ideas] Add "has_surrogates" flags to string object

Antoine Pitrou solipsis at pitrou.net
Tue Oct 8 13:43:43 CEST 2013


Le Tue, 08 Oct 2013 14:17:59 +0300,
Serhiy Storchaka <storchaka at gmail.com> a
écrit :
> Here is an idea about adding a mark to PyUnicode object which allows 
> fast answer to the question if a string has surrogate code. This mark 
> has one of three possible states:
> 
> * String doesn't contain surrogates.
> * String contains surrogates.
> * It is still unknown.
> 
> We can combine this with "is_ascii" flag in 2-bit value:
> 
> * String is ASCII-only (and doesn't contain surrogates).
> * String is not ASCII-only and doesn't contain surrogates.
> * String is not ASCII-only and contains surrogates.
> * String is not ASCII-only and it is still unknown if it contains
> surrogate.
> 
> By default a string is created in "unknown" state (if it is UCS2 or 
> UCS4). After first request it can be switched to "has surrogates" or 
> "hasn't surrogates". State of the result of concatenating or slicing
> can be determined from states of input strings.

Not true for slicing (you can take a non-surrogates slice of a
surrogates string). Other than that, this sounds reasonable to me,
provided that the patch isn't too complex and the perf improvements are
worth it.

Regards

Antoine.




More information about the Python-ideas mailing list