patch: speed up name access by up to 80%

Problem: Python name lookup in dictionaries is relatively slow. Possible solutions: 1. Find ways to bypass dictionary lookup. 2. Optimize the hell out of dictionary lookup. Examples of approach #1 are the existing fastlocals mechanism, PEP 266, PEP 267 and the recent proposal by GvR for speeding up instance attribute access. These proposals all face major difficulties resulting from the dynamic nature of Python's namespace and require tricky techniques of code analysis code to check for assignments that may change the visible namespace. I have chosen to try #2. The biggest difficulty with this approach is that dictionaries are already very well optimized :-) But I have found that Python dictionaries are mostly optimized for general-purpose use, not for use as namespace for running code. There are some characteristics unique to namespace access which have not been used for optimization: * Lookup keys are always interned strings. * The local/global/builtin fallback means that most accesses fail. The patch adds the function PyDict_GetItem_Fast to dictobject. This function is equivalent to PyDict_GetItem but is much faster for lookup using interned string keys and for lookups with a negative result. LOAD_NAME and LOAD_GLOBAL have been converted to use this function. Here are the timings for 200,000,000 accesses to fastlocals, locals, globals and builtins for Python 2.2 with and without the fastnames patch: 2.2 2.2+fastnames --------------------------------------- builtin.py 68.540s 37.900s global.py 54.570s 34.020s local.py 41.210s 32.780s fastlocal.py 24.530s 24.540s Fastlocals are still significantly faster than locals, but not by such a wide margin. You can notice that the speed differences between locals, globals and builtins are almost gone. The machine is a Pentium III at 866MHz running linux. Both interpreters were compiled with "./configure ; make". The test code is at the bottom of this message. You will not see any effect on the results of pybench because it uses only fastlocals inside the test loops. Real code that uses a lot of globals and builtins should see a significant improvement. Get the patch: http://www.tothink.com/python/fastnames/fastnames.patch The optimization techniques used: * Inline code PyDict_GetItem_Fast is actually an inline macro. If the item is stored at the first hash location it will be returned without any function calls. It also requires that the key used for the entry is the same interned string as the key. This fast inline version is possible because the first argument is known to be a valid dictionary and the second argument is known to be a valid interned string with a valid cached hash. * Key comparison by pointer If the inline macro fails PyDict_GetItem_Fast2 is called. This function searches for an entry with a key identical to the requested key. This search is faster than lookdict or lookdict_string because there are no expensive calls to external compare-by-value functions. Another small speedup is gained by not checking for free slots since this function is never used for setting items. If this search fails, the dictionary's ma_lookup function is called. * Interning of entry keys One reason the quick search could fail is because the entry was set using direct access to __dict__ instead of standard name assignment and therefore the entry key is not interned. In this case the entry key is replaced with the interned lookup key. The next fast search for the same key will succeed. There is a very good chance that it will be handled by the inline macro. * Negative entries In name lookup most accesses fail. In order to speed them up negative entries can mark a name as "positively not there", usually detected by the macro without requiring any function calls. Negative entries have the interned key as their me_key and me_value is NULL. Negative entries occupy real space in the hash table and cannot be reused as empty slots. This optimization technique is not practical for general purpose dictionaries because some types of code would quickly overload the dictionary with many negative entries. For name lookup the number of negative entries is bound by the number of global and builtin names referenced by the code that uses the dictionary as a namespace. This new type of slot has a surprisingly small impact on the rest of dictobject.c. Only one assertion had to be removed to accomodate it. All other code treats it as either an active entry (key !=NULL, !=dummy) or a deleted entry (value == NULL, key != NULL) and just happens to do the Right Thing for each case. If an entry with the same key as a negative entry is subsequently inserted into the dictionary it will overwrite the negative entry and be reflected immediately in the namespace. There is no caching and therefore no cache coherency issues. Known bugs: Negative entries do not resize the table. If there is not enough free space in the table they are simply not inserted. Assumes CACHE_HASH, INTERN_STRINGS without checking. Future directions: It should be possible to apply this to more than just LOAD_ATTR and LOAD_GLOBAL: attributes, modules, setting items, etc. The hit rate for the inline macro varies from 100% in most simple cases to 0% in cases where the first hash position in the table happens to be occupied by another entry. Even in these cases it is still very fast, but I want to get more consistent performance. I am starting to experiment with probabilistic techniques that shuffle entries in the hash table and try to ensure that the entries accessed most often are kept in the first hash positions as much as possible. Test code: * builtin.py class f: for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; * global.py hex = 5 class f: for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; * local.py class f: hex = 5 for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; * fastlocal.py def f(): hex = 5 for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; f() Oren

Oren> Problem: Python name lookup in dictionaries is relatively slow. ... [ lots of interesting stuff elided ] ... This looks pretty much like a no-brainer to me, assuming it stands up to close scrutiny. The only thing I'd change is that if PyDict_GetItem_Fast is a macro that only works for interned strings, I'd change its name to reflect its use: PyDict_GET_ITEM_INTERNED or something similar. As a further test, I suggest you give the pystone benchmark a whirl, not because it's such a kickass benchmark, but because it occasionally fiddles global variable values. I imagine it will do just fine and probably run a bit faster to boot. Skip

The only thing I'd change is that if PyDict_GetItem_Fast is a macro that only works for interned strings, I'd change its name to reflect its use: PyDict_GET_ITEM_INTERNED or something similar. One other naming change is to prefix PyDict_GetItem_Fast2 with an underscore, since the comments about its use make it clear that it's an internal function. Skip

On Mon, Feb 11, 2002 at 02:52:30PM -0600, Skip Montanaro wrote:
The only thing I'd change is that if PyDict_GetItem_Fast is a macro that only works for interned strings, I'd change its name to reflect its use: PyDict_GET_ITEM_INTERNED or something similar.
One other naming change is to prefix PyDict_GetItem_Fast2 with an underscore, since the comments about its use make it clear that it's an internal function.
Done and up on the same URL. PyDict_GetItem_Fast -> PyDict_GETITEM_INTERNED PyDict_GetItem_Fast2 -> _PyDict_GetItem_Interned Oren

Oren Tirosh wrote:
Problem: Python name lookup in dictionaries is relatively slow.
I tried this patch (*) by running the regression tests: make && time ./python -E -tt Lib/test/regrtest.py All the expected tests passed and there were no failures, this is good. The bad news is that it was slower. It took 42 user seconds longer with the patch than without. Before patch: real 3m1.031s user 1m19.480s sys 0m2.400s After patch: real 3m38.071s user 1m51.760s sys 0m2.790s The box is Linux 2.4, Athlon 650, 256 MB. (*) Pretty sure this was patch #1, running sum yields: 53200 10. But it shouldn't matter, since it was only a name change, right? Neal

On Mon, Feb 11, 2002 at 05:26:01PM -0500, Neal Norwitz wrote:
Oren Tirosh wrote:
Problem: Python name lookup in dictionaries is relatively slow.
I tried this patch (*) by running the regression tests:
make && time ./python -E -tt Lib/test/regrtest.py
All the expected tests passed and there were no failures, this is good. The bad news is that it was slower. It took 42 user seconds longer with the patch than without.
I have tried this and got the same results for the patched and unpatched versions (+-1 second). The regression tests spend most of their time on things like threads, sockets, signals, etc that have a lot of variance and are not really affected by name lookup speed. I got strage results comparing to the python2.2 RPM package (some faster, some slower). I didn't start to get consistent results until I used two freshly compiled interpreters. I wonder with what options this package was compiled. Oren

Oren Tirosh wrote:
On Mon, Feb 11, 2002 at 05:26:01PM -0500, Neal Norwitz wrote:
Oren Tirosh wrote:
Problem: Python name lookup in dictionaries is relatively slow.
I tried this patch (*) by running the regression tests:
make && time ./python -E -tt Lib/test/regrtest.py
All the expected tests passed and there were no failures, this is good. The bad news is that it was slower. It took 42 user seconds longer with the patch than without.
I have tried this and got the same results for the patched and unpatched versions (+-1 second). The regression tests spend most of their time on things like threads, sockets, signals, etc that have a lot of variance and are not really affected by name lookup speed.
I rebuilt everything from scratch and got results similar to Oren's, ie, roughly the same. This time I took off the test-coverage flags. (Sorry, I must have had them off for stock, but on with the Oren's patch). Before patch: real 2m57.416s user 1m12.830s sys 0m2.580s After patch: real 2m56.017s user 1m14.960s sys 0m2.380s I still have inlines turned off. Neal

So simple benchmark programs are a lot more interesting. I'd pick pystone, test_hmac, and test_htmlparser. Jeremy

On Mon, Feb 11, 2002 at 07:14:25PM -0500, Jeremy Hylton wrote:
So simple benchmark programs are a lot more interesting.
I'd pick pystone, test_hmac, and test_htmlparser.
test_htmlparser (x100): 0m29.950s 0m29.730s test_hmac (x1000): 0m16.480s 0m15.720s (lower is better) pystone: 11261.3 11494.3 (higher is better) A small, but measureable improvement. You can see below that most accesses are still to fastlocals and, of course, the code has some real work to do other than looking up names. test_htmlparser: 362331 fastlocal non-dictionary lookups 60106 inline dictionary lookups 10554 fast dictionary lookups 151 slow dictionary lookups test_hmac: 13959 fastlocal non-dictionary lookups 9920 inline dictionary lookups 7548 fast dictionary lookups 240 slow dictionary lookups pystone: 1447094 fastlocal non-dictionary lookups 502190 inline dictionary lookups 111549 fast dictionary lookups 111 slow dictionary lookups Anyone has an example of a program that relies on a lot of global and builtin name accesses? Meanwhile I'm going to start working on LOAD_ATTR. if-the-evidence-doesn't-fit-the-theory...-ly yours, Oren

On Mon, Feb 11, 2002 at 05:55:38PM -0500, Oren Tirosh wrote:
I got strage results comparing to the python2.2 RPM package (some faster, some slower). I didn't start to get consistent results until I used two
The 2.2-3 RPM from python.org does "./configure --prefix=/usr" (unless you enable the pymalloc or ipv6 flags before building, the RPMs up there do not). It then does "make". About the only thing that may be unusual would be that RPM may automatically strip the resulting binary. I'm not doing it manually. Interesting that you're seeing oddness. Note that if you download the SRPM, you can install the .src.rpm and then build it by doing: rpm -bc /usr/src/redhat/SPECS/python-2.2.spec (or other similar location that the spec file would be installed depending on your distribution). Also note that you can have it build a patched version by adding the patch below the "Patch1:" line as "Patch2:", and also add a line below "%patch1" which reads "%patch2" (unless you have to give options such as "-p1" -- see the example for "%patch0"). You can then do a fresh build of the code using the above command. You can also build an RPM by using "rpm -ba [...]". One of the big wins with a packaging system -- reproducability... Sean -- Let us live!!! Let us love!!! Let us share the deepest secrets of our souls!!! You first. Sean Reifschneider, Inimitably Superfluous <jafo@tummy.com> tummy.com - Linux Consulting since 1995. Qmail, KRUD, Firewalls, Python

Some other things you might want to try: * Inline small dictionary tables in the PyObject struct and only revert to external tables for larger ones. (I have an old patch for this one which you might want to update) * Optimize perfect hashings. Sometimes (hopefully most of the times) Python will generate a perfect hashing for a set of attributes. In that case, it could set a flag in the dictionary object to be able to use a faster lookup function. BTW, could you run pybench against your patch ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On Tue, Feb 12, 2002 at 12:40:11PM +0100, M.-A. Lemburg wrote:
* Inline small dictionary tables in the PyObject struct and only revert to external tables for larger ones. (I have an old patch for this one which you might want to update)
* Optimize perfect hashings. Sometimes (hopefully most of the times) Python will generate a perfect hashing for a set of attributes. In that case, it could set a flag in the dictionary object to be able to use a faster lookup function.
Interesting, but I am exploring other directions now: attribute access, hints associated to negative entries that should speed up the next lookup in the chain and getting the inline/fast ratio from 3:1 up to 10:1 or higher.
BTW, could you run pybench against your patch ?
18331152 fastlocal non-dictionary lookups 416661 inline dictionary lookups 131509 fast dictionary lookups 200 slow dictionary lookups With 97% of accesses using fastlocals it's not going to have any significant effect. Oren

* Inline small dictionary tables in the PyObject struct and only revert to external tables for larger ones. (I have an old patch for this one which you might want to update)
I may be missing some context, but AFAIK we already do this. See dictobject.h: the last item in struct _dictobject is PyDictEntry ma_smalltable[PyDict_MINSIZE]; --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
* Inline small dictionary tables in the PyObject struct and only revert to external tables for larger ones. (I have an old patch for this one which you might want to update)
I may be missing some context, but AFAIK we already do this. See dictobject.h: the last item in struct _dictobject is
PyDictEntry ma_smalltable[PyDict_MINSIZE];
Nice :-) I must have missed that addition (or simply forgotten about it). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
participants (7)
-
Guido van Rossum
-
Jeremy Hylton
-
M.-A. Lemburg
-
Neal Norwitz
-
Oren Tirosh
-
Sean Reifschneider
-
Skip Montanaro