[Python-3000] PyUnicodeObject implementation

Sun Sep 7 16:46:29 CEST 2008

Antoine Pitrou wrote:
> Stefan Behnel <stefan_ml <at> behnel.de> writes:
>> From a Cython perspective, I find the lack of efficient subclassing after such
>> a change particularly striking. That seriously bit me in Py2 when I tried
>> making XML text content a bit more intelligent in lxml (i.e. make it remember
>> what XML element it originated from).
> 
> I've used a library which had adopted this kind of behaviour (I think it was
> BeautifulSoup). After using it several times in a row I noticed memory
> consumption of my program exploded. The problem was that the library was
> returning objects which looked innocently like strings, but internally kept a
> reference to a multi-megabyte HTML tree. The solution was to convert them
> explicitly to str before storing them for later use, which defeated the point of
> having an str-derived type.

I'm aware of that problem.

> In these cases I think it's much friendlier to the user of the API to use
> composition rather than inheritance. Or, simply, just return a raw string and
> let the user keep the context separately if he wants to.

That's not that easy for the result of an arbitrary XPath query. But you can
switch the behaviour off when you build the query, so that it gives you a
straight string as result.

> PS: what do you call "efficient subclassing"? if you look at the current
> implementation of unicode_subtype_new() in unicodeobject.c, it isn't very
> efficient (everything including the raw data buffer is allocated twice).

That's something that may be optimised one day without affecting user code. A
different memory layout that prevents C-level subclassing is a very different
kind of change.

Plus, even with the double-allocation, a C-level subclass is still faster than
a Python-level subclass for me. Setup for timeit:

        s = b"abcdef ghijk";
        from lxml.etree import _ElementUnicodeResult;
        u = type("u", (unicode,), {})

$ python2.6 -m timeit ... 'unicode(s)'
1000000 loops, best of 3: 0.623 usec per loop

$ python2.6 -m timeit -s ... '_ElementUnicodeResult(s)'
1000000 loops, best of 3: 0.822 usec per loop

$ python2.6 -m timeit -s ... 'u(s)'
1000000 loops, best of 3: 0.849 usec per loop

$ python2.6 -m timeit -s ... 'unicode(s, "utf-8")'
1000000 loops, best of 3: 0.622 usec per loop

$ python2.6 -m timeit -s ... '_ElementUnicodeResult(s, "utf-8")'
1000000 loops, best of 3: 0.806 usec per loop

$ python2.6 -m timeit -s ... 'u(s, "utf-8")'
1000000 loops, best of 3: 0.844 usec per loop

Doing the same with a unicode string as input gives me lower but similar numbers.

Stefan