[Python-ideas] Add "has_surrogates" flags to string object

Victor Stinner victor.stinner at gmail.com
Tue Oct 8 14:23:09 CEST 2013


I like the idea. I prefer to add another flag (1 bit), instead of
having a complex with 4 different values.

Your idea looks specific to the PEP 393, so I prefer to keep the flag
private. Otherwise it would be hard for other implementations of
Python to implement the function getting the flag value.

Victor

2013/10/8 Serhiy Storchaka <storchaka at gmail.com>:
> Here is an idea about adding a mark to PyUnicode object which allows fast
> answer to the question if a string has surrogate code. This mark has one of
> three possible states:
>
> * String doesn't contain surrogates.
> * String contains surrogates.
> * It is still unknown.
>
> We can combine this with "is_ascii" flag in 2-bit value:
>
> * String is ASCII-only (and doesn't contain surrogates).
> * String is not ASCII-only and doesn't contain surrogates.
> * String is not ASCII-only and contains surrogates.
> * String is not ASCII-only and it is still unknown if it contains surrogate.
>
> By default a string is created in "unknown" state (if it is UCS2 or UCS4).
> After first request it can be switched to "has surrogates" or "hasn't
> surrogates". State of the result of concatenating or slicing can be
> determined from states of input strings.
>
> This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little
> faster UTF-8 encoding) and converting to wchar_t* if string hasn't
> surrogates (this is true in most cases).


More information about the Python-ideas mailing list