[Python-ideas] Add "has_surrogates" flags to string object
M.-A. Lemburg
mal at egenix.com
Tue Oct 8 13:58:00 CEST 2013
On 08.10.2013 13:17, Serhiy Storchaka wrote:
> Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if
> a string has surrogate code. This mark has one of three possible states:
>
> * String doesn't contain surrogates.
> * String contains surrogates.
> * It is still unknown.
>
> We can combine this with "is_ascii" flag in 2-bit value:
>
> * String is ASCII-only (and doesn't contain surrogates).
> * String is not ASCII-only and doesn't contain surrogates.
> * String is not ASCII-only and contains surrogates.
> * String is not ASCII-only and it is still unknown if it contains surrogate.
>
> By default a string is created in "unknown" state (if it is UCS2 or UCS4). After first request it
> can be switched to "has surrogates" or "hasn't surrogates". State of the result of concatenating or
> slicing can be determined from states of input strings.
>
> This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little faster UTF-8 encoding)
> and converting to wchar_t* if string hasn't surrogates (this is true in most cases).
I guess you could use one bit from the kind structure
for that:
/* Character size:
- PyUnicode_WCHAR_KIND (0):
* character type = wchar_t (16 or 32 bits, depending on the
platform)
- PyUnicode_1BYTE_KIND (1):
* character type = Py_UCS1 (8 bits, unsigned)
* all characters are in the range U+0000-U+00FF (latin1)
* if ascii is set, all characters are in the range U+0000-U+007F
(ASCII), otherwise at least one character is in the range
U+0080-U+00FF
- PyUnicode_2BYTE_KIND (2):
* character type = Py_UCS2 (16 bits, unsigned)
* all characters are in the range U+0000-U+FFFF (BMP)
* at least one character is in the range U+0100-U+FFFF
- PyUnicode_4BYTE_KIND (4):
* character type = Py_UCS4 (32 bits, unsigned)
* all characters are in the range U+0000-U+10FFFF
* at least one character is in the range U+10000-U+10FFFF
*/
unsigned int kind:3;
For some reason, it allocates 3 bits, but only 2 bits are
used.
The again, the state struct is unsigned int, so there's still plenty
of room for extra flags.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Oct 08 2013)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2013-10-14: PyCon DE 2013, Cologne, Germany ... 6 days to go
::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
More information about the Python-ideas
mailing list