[Python-ideas] Add "has_surrogates" flags to string object

Tue Oct 8 13:58:00 CEST 2013

On 08.10.2013 13:17, Serhiy Storchaka wrote:
> Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if
> a string has surrogate code. This mark has one of three possible states:
> 
> * String doesn't contain surrogates.
> * String contains surrogates.
> * It is still unknown.
> 
> We can combine this with "is_ascii" flag in 2-bit value:
> 
> * String is ASCII-only (and doesn't contain surrogates).
> * String is not ASCII-only and doesn't contain surrogates.
> * String is not ASCII-only and contains surrogates.
> * String is not ASCII-only and it is still unknown if it contains surrogate.
> 
> By default a string is created in "unknown" state (if it is UCS2 or UCS4). After first request it
> can be switched to "has surrogates" or "hasn't surrogates". State of the result of concatenating or
> slicing can be determined from states of input strings.
> 
> This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little faster UTF-8 encoding)
> and converting to wchar_t* if string hasn't surrogates (this is true in most cases).

I guess you could use one bit from the kind structure
for that:

        /* Character size:

           - PyUnicode_WCHAR_KIND (0):

             * character type = wchar_t (16 or 32 bits, depending on the
               platform)

           - PyUnicode_1BYTE_KIND (1):

             * character type = Py_UCS1 (8 bits, unsigned)
             * all characters are in the range U+0000-U+00FF (latin1)
             * if ascii is set, all characters are in the range U+0000-U+007F
               (ASCII), otherwise at least one character is in the range
               U+0080-U+00FF

           - PyUnicode_2BYTE_KIND (2):

             * character type = Py_UCS2 (16 bits, unsigned)
             * all characters are in the range U+0000-U+FFFF (BMP)
             * at least one character is in the range U+0100-U+FFFF

           - PyUnicode_4BYTE_KIND (4):

             * character type = Py_UCS4 (32 bits, unsigned)
             * all characters are in the range U+0000-U+10FFFF
             * at least one character is in the range U+10000-U+10FFFF
         */
        unsigned int kind:3;

For some reason, it allocates 3 bits, but only 2 bits are
used.

The again, the state struct is unsigned int, so there's still plenty
of room for extra flags.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 08 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2013-10-14: PyCon DE 2013, Cologne, Germany ...             6 days to go

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/