[Python-3000] string C API
Jim Jewett
jimjjewett at gmail.com
Tue Oct 3 22:29:13 CEST 2006
On 10/3/06, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Jim Jewett schrieb:
> > By knowing that there is only one possible representation for a given
> > string, he skips the equivalency cache. On the other hand, he also
> > loses the equivalency cache.
> What is an equivalency cache, and why would one like to have one?
Same string, different encoding.
The Py 2.x unicode implementation saves a cached copy of the string
encoded in the default coding, but
(1) it always creates the UCS4 (or UCS2) encoding, even though it
isn't always needed.
(2) any 3rd encoding -- not matter how frequent -- requires either
a fresh copy every time, or manual caching.
An equivalency cache would save all input/output encodings that the
string was recoded to/from. (Possibly only with weak references --
the mapping itself might benefit from tuning based on various
applications.)
Today
Get a string in Latin-1 (or UTF-8, or ...)
recode to UCS4 (and throw out the latin-1?)
process...
Get another string in Latin-1
recode it to UC4 (and throw out the latin-1?
compare them (in UCS4)
convert first string to UTF-8 for output.
...
reconvert first string to UTF-8 again, because it wasn't saved
With my proposal
Get a string in Latin-1 (or UTF-8, or ...)
XXX nope, delay (or skip) recoding
process...
Get another string in Latin-1
XXX delay or skip recoding
compare them (in original Latin-1).
convert first string to UTF-8 for output. (and save this in the
encodings cache)
...
reconvert first string to UTF-8 again -- this time just resend the
earlier copy.
> > By exposing the full object insted of the abstract interface,
> > compilers can do pointer addition instead of calling a get_data
> > function. But they still don't know (until run time) how wide the
> > data at that pointer will be, and we're locked into binary
> > compatibility.
> That's not true. The internal representation of objects can and did
> change across releases. People have to and will recompile their
> extension modules for a new feature release.
recompile is different from rewrite; if you have said "the data will
be at this location" instead of "you can get a pointer to the data
from this method", you can't really change that later.
> >> I doubt any kind of "pluggable" representation could work in a
> >> reasonable way. With that generality, you lose any information
> >> as to what the internal representation is, and then code becomes
> >> tedious to write and slow to run.
> > Instead of working with ((string)obj).data directly, you work with
> > string.recode(object, desired)
> ... causing a copy of the data, right? This is expensive.
Only if
(1) You insist on a specific encoding
(2) That encoding is not already available, either as the way it
started, or through the equivalency cache.
> > If you're saying this will be slow because it is a C function call,
> > then I can't really argue; I just think it will be a good trade for
> > all the times we don't recode at all (or recode only once/encoding).
> It's not the function call that makes it slow. It's the copying of
> potentially large string data that a recoding requires. In addition,
> for some encodings, the algorithm to do the transformation is
> fairly slow.
Which is why I would like the equivalency cache to save each of the
UCS4, Latin-1, and UTF-8 byte patterns once they're created, but not
to create any of them until needed.
-jJ
More information about the Python-3000
mailing list