[Python-ideas] Add "has_surrogates" flags to string object
Victor Stinner
victor.stinner at gmail.com
Fri Oct 11 14:12:37 CEST 2013
2013/10/8 Serhiy Storchaka <storchaka at gmail.com>:
> Here is an idea about adding a mark to PyUnicode object which allows fast
> answer to the question if a string has surrogate code. This mark has one of
> three possible states:
>
> * String doesn't contain surrogates.
> * String contains surrogates.
> * It is still unknown.
>
> We can combine this with "is_ascii" flag in 2-bit value:
>
> * String is ASCII-only (and doesn't contain surrogates).
> * String is not ASCII-only and doesn't contain surrogates.
> * String is not ASCII-only and contains surrogates.
> * String is not ASCII-only and it is still unknown if it contains surrogate.
>
> By default a string is created in "unknown" state (if it is UCS2 or UCS4).
> After first request it can be switched to "has surrogates" or "hasn't
> surrogates". State of the result of concatenating or slicing can be
> determined from states of input strings.
>
> This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little
> faster UTF-8 encoding) and converting to wchar_t* if string hasn't
> surrogates (this is true in most cases).
Knowing if a string contains any surrogate character would also
speedup marshal and pickle modules:
http://bugs.python.org/issue19219#msg199465
Victor
More information about the Python-ideas
mailing list