[Python-3000] string C API

Tue Oct 3 12:29:14 CEST 2006

Jim Jewett schrieb:
>>> Interning may get awkward if multiple encodings are allowed within a
>>> program, regardless of whether they're allowed for single strings.  It
>>> might make sense to intern only strings that are in the same encoding
>>> as the source code.  (Or whose values are limited to ASCII?)
> 
>> Why?  If the text hash function is defined on *code points*, then
>> interning, or really any arbitrary dictionary lookup is the same as it
>> has always been.
> 
> The problem isn't the hash; it is the equality.  Which encoding do you
> keep interned?

Are you using the verb "to intern" here in the sense of the intern()
builtin()? If so: intern the representation of the string that gets
interned first. Python currently interns the entire string object
(not just the character data); I see no reason to change that.

>> What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
>> method I previously described is that each of the encodings offer a
>> minimal representation of the code points that the text object contains.
> 
> There may be some thrashing as
> 
>     s+= (larger char)

This creates a new string, in any case (remember, strings are not
mutable). Assume the old value A was ucs-2, and (larger char) B is
ucs-4. The code to perform the addition would be

   C = PyString_New(A->ob_size+b->ob_size, UCS4);
   UCS2 *a_data = PyString_AsUCS2(A);
   UCS4 *b_data = PyString_AsUCS4(B);
   UCS4 *c_data = PyString_AsUCS4();
   for(int k=0; k < A->ob_size; k++)
      *c_data++ = *a_data++;
   for(int k = 0; k < B->ob_size; k++)
      *c_data++ = *b_data;
   *c_data = 0;

Notice that this code is independent from whether A and B have
different representations or not.

>     s[:6]

This would require two iterations over the string: one to find
the maximum character, and the second to perform the actual
copying.

> The three options might well be a sensible choice, but I think it
> would already have much of the disadvantage of multiple internal
> encodings, and we might eventually regret any specific limits.  (Why
> not the local 8-bit?  Why not UTF-8, if that is the system encoding?)

Take a look at above code. Here, I never invoke a codec routine (which
would be quite expensive). Instead, I rely on the fact that the
characters have the same numeric values in all three representations.

> It is easy enough to answer why not for each specific case, but I'm
> not *certain* that it is the right answer -- so why not leave it up to
> implementors if they want to do more than the basic three?

Not sure what implementors you are talking about: anybody who wants
to clone Python is free to do whatever they want. We *are* the
implementors of CPython, and if we don't want to do more, then
we just don't want it.

Regards,
Martin