[Python-3000] string C API

Tue Oct 3 21:33:33 CEST 2006

Jim Jewett schrieb:
> By knowing that there is only one possible representation for a given
> string, he skips the equivalency cache.  On the other hand, he also
> loses the equivalency cache.

What is an equivalency cache, and why would one like to have one?

> When python 2.x chooses the unicode
> width, it tries to match tcl; under a "minimal size possible" scheme,
> strings that fit in ASCII will have to be recoded twice on every round
> trip.  The same problem pops up with other extension modules, and with
> system encodings.

In _tkinter, strings have to be copied *always*, whether they use the
same representation or a different one. Tcl requires strings to be
represented in a TclObj; you cannot pass a Python string object directly
into Tcl. As you have to copy, anyway, it doesn't matter if you do
size conversions in the process.

> By exposing the full object insted of the abstract interface,
> compilers can do pointer addition instead of calling a get_data
> function.  But they still don't know (until run time) how wide the
> data at that pointer will be, and we're locked into binary
> compatibility.

That's not true. The internal representation of objects can and did
change across releases. People have to and will recompile their
extension modules for a new feature release.

>> I doubt any kind of "pluggable" representation could work in a
>> reasonable way. With that generality, you lose any information
>> as to what the internal representation is, and then code becomes
>> tedious to write and slow to run.
> 
> Instead of working with ((string)obj).data directly, you work with
> string.recode(object, desired)

... causing a copy of the data, right? This is expensive.

> If you're saying this will be slow because it is a C function call,
> then I can't really argue; I just think it will be a good trade for
> all the times we don't recode at all (or recode only once/encoding).

It's not the function call that makes it slow. It's the copying of
potentially large string data that a recoding requires. In addition,
for some encodings, the algorithm to do the transformation is
fairly slow.

> I'll admit that I'm not sure what sort of data would make a real-world
> (as opposed to contrived) benchmark.

Any kind of text application will suffer if strings get constantly
recoded.

Regards,
Martin