[Python-3000] Allocation of unicode objects

Sun Feb 10 13:53:53 CET 2008

Hi,

Since there are discussions going on on the topic of allocation algorithms for
various built-in types, I thought I'd mention there's a patch for turning
unicode objects into variable-sized objects (rather than using a
separately-allocated buffer). The aim is to make allocation of those objects
lighter, and relieve cache and memory pressure a bit.

http://bugs.python.org/issue1943

Marc-André Lemburg expressed skepticism, based on the fact that it made
subclassing unicode objects as part of C extensions more difficult.

And here is a microbenchmark of the thing:

Splitting a small string:
./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()"
-> Unpatched py3k: 26.4 usec per loop
-> PyVarObject patch: 20.2 usec per loop

Splitting a medium-sized string:
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
-> Unpatched py3k: 458 usec per loop
-> PyVarObject patch: 316 usec per loop

Splitting a long string:
./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()"
-> Unpatched py3k: 31.3 msec per loop
-> PyVarObject patch: 17.8 msec per loop

Even if the patch is rejected, I think it is important to remember that
implementation characteristics of the unicode type will be crucial for Py3k
performance :-)

Regards

Antoine.