[Python-3000] string C API

Tue Oct 3 22:29:13 CEST 2006

On 10/3/06, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Jim Jewett schrieb:
> > By knowing that there is only one possible representation for a given
> > string, he skips the equivalency cache.  On the other hand, he also
> > loses the equivalency cache.

> What is an equivalency cache, and why would one like to have one?

Same string, different encoding.

The Py 2.x unicode implementation saves a cached copy of the string
encoded in the default coding, but

    (1) it always creates the UCS4 (or UCS2) encoding, even though it
isn't always needed.
    (2) any 3rd encoding -- not matter how frequent -- requires either
a fresh copy every time, or manual caching.

An equivalency cache would save all input/output encodings that the
string was recoded to/from.  (Possibly only with weak references --
the mapping itself might benefit from tuning based on various
applications.)

Today

    Get a string in Latin-1 (or UTF-8, or ...)
    recode to UCS4 (and throw out the latin-1?)
    process...

    Get another string in Latin-1
    recode it to UC4 (and throw out the latin-1?

    compare them (in UCS4)

    convert first string to UTF-8 for output.
    ...
    reconvert first string to UTF-8 again, because it wasn't saved

With my proposal

    Get a string in Latin-1 (or UTF-8, or ...)
    XXX nope, delay (or skip) recoding
    process...

    Get another string in Latin-1
    XXX delay or skip recoding

    compare them (in original Latin-1).

    convert first string to UTF-8 for output.  (and save this in the
encodings cache)
    ...
    reconvert first string to UTF-8 again -- this time just resend the
earlier copy.

> > By exposing the full object insted of the abstract interface,
> > compilers can do pointer addition instead of calling a get_data
> > function.  But they still don't know (until run time) how wide the
> > data at that pointer will be, and we're locked into binary
> > compatibility.

> That's not true. The internal representation of objects can and did
> change across releases. People have to and will recompile their
> extension modules for a new feature release.

recompile is different from rewrite; if you have said "the data will
be at this location" instead of "you can get a pointer to the data
from this method", you can't really change that later.

> >> I doubt any kind of "pluggable" representation could work in a
> >> reasonable way. With that generality, you lose any information
> >> as to what the internal representation is, and then code becomes
> >> tedious to write and slow to run.

> > Instead of working with ((string)obj).data directly, you work with
> > string.recode(object, desired)

> ... causing a copy of the data, right? This is expensive.

Only if
    (1)  You insist on a specific encoding
    (2)  That encoding is not already available, either as the way it
started, or through the equivalency cache.

> > If you're saying this will be slow because it is a C function call,
> > then I can't really argue; I just think it will be a good trade for
> > all the times we don't recode at all (or recode only once/encoding).

> It's not the function call that makes it slow. It's the copying of
> potentially large string data that a recoding requires. In addition,
> for some encodings, the algorithm to do the transformation is
> fairly slow.

Which is why I would like the equivalency cache to save each of the
UCS4, Latin-1, and UTF-8 byte patterns once they're created, but not
to create any of them until needed.

-jJ