[Python-3000] string C API
Josiah Carlson
jcarlson at uci.edu
Sat Sep 16 20:51:33 CEST 2006
"Martin v. Löwis" <martin at v.loewis.de> wrote:
>
> Nick Coghlan schrieb:
> > If an 8-bit encoding other than latin-1 is used for the internal buffer,
> > then every comparison operation would have to decode the string to
> > Unicode in order to compare code points.
> >
> > It seems much simpler to me to ensure that what is stored internally is
> > *always* the Unicode code points, with the width (1, 2 or 4 bytes)
> > determined by the largest code point in the string.
>
> Just try implementing comparison some time. You can end up implementing
> the same algorithm six times at least, once for each pair (1,1), (1,2),
> (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e.
> you can't reduce (2,1) to (1,2)), you need 9 different versions of the
> algorithm. That sounds more complicated than always decoding.
One algorithm. Each character can be "decoded" during runtime.
long expand(void* buffer, Py_ssize_t posn, int shift) {
buffer += posn << shift;
switch (bpc) {
case 0: return ((unsigned char*)buffer)[0];
case 1: return ((unsigned short*)buffer)[0];
case 2: return ((long*)buffer)[0];
default: return -1;
}
Alternatively, with a little work, the 9 variants can be defined with a
prototype system, using macros or otherwise.
- Josiah
More information about the Python-3000
mailing list