[Python-3000] PyUnicodeObject implementation

Guido van Rossum guido at python.org
Mon Sep 8 00:55:32 CEST 2008


On Sun, Sep 7, 2008 at 2:23 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Guido van Rossum wrote:
>> All in all, given the advantage (half the number of allocations) of
>> the proposal I think there would have to be *very* good arguments
>> against before we reject this outright. I'd like to understand
>> Marc-Andre's reasons too.
>
> As Stefan notes, because of the frequency with which strings are
> manipulated in C code via PyString_* / PyUnicode_* calls, it is a data
> type where "accept no substitutes" prevails.
>
> MAL's primary concern appears to be that having Unicode as a plain
> PyObject leaves the type more open to subclass-based optimisations that
> have been rejected for the builtin types themselves.

Hm. I don't have any particularly insightful imagination as to what
those optimizations might be. Have any been implemented (in 3rd party
code) in the 8 years that the Unicode object has existed?

> Having
> PyString/PyBytes as PyVarObjects means that subclasses are more limited
> in what they can do.

True.

> One possibility that occurs to me is to use a PyVarObject variant that
> allocates space for an additional void pointer before the variable sized
> section of the object. The builtin type would leave that pointer NULL,
> but subtypes could perform the second allocation needed to populate it.
>
> The question is whether the 4-8 bytes wasted per object would be worth
> the fact that only one memory allocation would be needed.

I believe that 4-8 bytes is more than the overhead of an extra memory
allocation from the obmalloc heap. It is probably about the same as
the overhead for a memory allocation from the regular malloc heap. So
for short strings (of which there are often a lot) it would be more
expensive; for longer objects it would probably work out just about
the same.

There could be a different approach though, whereby the offset from
the start of the object to the start of the character array wasn't a
constant but a value stored in the class object. (In fact,
tp_basicsize could probably be used for this.) It would slow down
access to the characters a bit though -- a classic time-space
trade-off that would require careful measurement in order to decide
which is better.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list