Problem: Python name lookup in dictionaries is relatively slow.
Examples of approach #1 are the existing fastlocals mechanism, PEP 266, PEP 267 and the recent proposal by GvR for speeding up instance attribute access. These proposals all face major difficulties resulting from the dynamic nature of Python's namespace and require tricky techniques of code analysis code to check for assignments that may change the visible namespace.
I have chosen to try #2. The biggest difficulty with this approach is that dictionaries are already very well optimized :-) But I have found that Python dictionaries are mostly optimized for general-purpose use, not for use as namespace for running code. There are some characteristics unique to namespace access which have not been used for optimization:
The patch adds the function PyDict_GetItem_Fast to dictobject. This function is equivalent to PyDict_GetItem but is much faster for lookup using interned string keys and for lookups with a negative result. LOAD_NAME and LOAD_GLOBAL have been converted to use this function.
Here are the timings for 200,000,000 accesses to fastlocals, locals, globals and builtins for Python 2.2 with and without the fastnames patch:
builtin.py 68.540s 37.900s global.py 54.570s 34.020s local.py 41.210s 32.780s fastlocal.py 24.530s 24.540s
Fastlocals are still significantly faster than locals, but not by such a wide margin. You can notice that the speed differences between locals, globals and builtins are almost gone.
The machine is a Pentium III at 866MHz running linux. Both interpreters were compiled with "./configure ; make". The test code is at the bottom of this message. You will not see any effect on the results of pybench because it uses only fastlocals inside the test loops. Real code that uses a lot of globals and builtins should see a significant improvement.
Get the patch:
The optimization techniques used:
Inline code PyDict_GetItem_Fast is actually an inline macro. If the item is stored at the first hash location it will be returned without any function calls. It also requires that the key used for the entry is the same interned string as the key. This fast inline version is possible because the first argument is known to be a valid dictionary and the second argument is known to be a valid interned string with a valid cached hash.
Key comparison by pointer If the inline macro fails PyDict_GetItem_Fast2 is called. This function searches for an entry with a key identical to the requested key. This search is faster than lookdict or lookdict_string because there are no expensive calls to external compare-by-value functions. Another small speedup is gained by not checking for free slots since this function is never used for setting items. If this search fails, the dictionary's ma_lookup function is called.
Interning of entry keys One reason the quick search could fail is because the entry was set using direct access to __dict__ instead of standard name assignment and therefore the entry key is not interned. In this case the entry key is replaced with the interned lookup key. The next fast search for the same key will succeed. There is a very good chance that it will be handled by the inline macro.
Negative entries In name lookup most accesses fail. In order to speed them up negative entries can mark a name as "positively not there", usually detected by the macro without requiring any function calls. Negative entries have the interned key as their me_key and me_value is NULL. Negative entries occupy real space in the hash table and cannot be reused as empty slots. This optimization technique is not practical for general purpose dictionaries because some types of code would quickly overload the dictionary with many negative entries. For name lookup the number of negative entries is bound by the number of global and builtin names referenced by the code that uses the dictionary as a namespace.
This new type of slot has a surprisingly small impact on the rest of dictobject.c. Only one assertion had to be removed to accomodate it. All other code treats it as either an active entry (key !=NULL, !=dummy) or a deleted entry (value == NULL, key != NULL) and just happens to do the Right Thing for each case.
If an entry with the same key as a negative entry is subsequently inserted into the dictionary it will overwrite the negative entry and be reflected immediately in the namespace. There is no caching and therefore no cache coherency issues.
Negative entries do not resize the table. If there is not enough free space in the table they are simply not inserted.
Assumes CACHE_HASH, INTERN_STRINGS without checking.
It should be possible to apply this to more than just LOAD_ATTR and LOAD_GLOBAL: attributes, modules, setting items, etc.
The hit rate for the inline macro varies from 100% in most simple cases to 0% in cases where the first hash position in the table happens to be occupied by another entry. Even in these cases it is still very fast, but I want to get more consistent performance. I am starting to experiment with probabilistic techniques that shuffle entries in the hash table and try to ensure that the entries accessed most often are kept in the first hash positions as much as possible.
builtin.py class f: for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
global.py hex = 5 class f: for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
local.py class f: hex = 5 for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
fastlocal.py def f(): hex = 5 for i in xrange(10000000): hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;