[Python-3000] string C API

Sat Sep 16 20:51:33 CEST 2006

"Martin v. Löwis" <martin at v.loewis.de> wrote:
> 
> Nick Coghlan schrieb:
> > If an 8-bit encoding other than latin-1 is used for the internal buffer,
> > then every comparison operation would have to decode the string to
> > Unicode in order to compare code points.
> > 
> > It seems much simpler to me to ensure that what is stored internally is
> > *always* the Unicode code points, with the width (1, 2 or 4 bytes)
> > determined by the largest code point in the string.
> 
> Just try implementing comparison some time. You can end up implementing
> the same algorithm six times at least, once for each pair (1,1), (1,2),
> (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e.
> you can't reduce (2,1) to (1,2)), you need 9 different versions of the
> algorithm. That sounds more complicated than always decoding.

One algorithm.  Each character can be "decoded" during runtime.

long expand(void* buffer, Py_ssize_t posn, int shift) {
    buffer += posn << shift;
    switch (bpc) {
    case 0:  return ((unsigned char*)buffer)[0];
    case 1:  return ((unsigned short*)buffer)[0];
    case 2:  return ((long*)buffer)[0];
    default: return -1;
    }

Alternatively, with a little work, the 9 variants can be defined with a
prototype system, using macros or otherwise.

 - Josiah