[Python-ideas] Exporting dict Items for direct lookups of specific keys

Eyal Lotem eyal.lotem at gmail.com
Mon Jun 11 03:43:28 CEST 2007


On 6/10/07, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Eyal Lotem" <eyal.lotem at gmail.com> wrote:
> > On 6/10/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > > "Eyal Lotem" <eyal.lotem at gmail.com> wrote:
> > > > B. Some lookups use static "literal" keys, and could benefit from
> > > > accessing a dict item directly (very fast O(1) _worst_ case, without
> > > > even a function call when done from the C side).
> > >
> > > [snip]
> > >
> > > What you are proposing is to add a level of indirection between some
> > > pointer in the dictionary to some special PyDictItem object. This will
> > > slow Python execution when such a thing is used.  Why? The extra level
> > > of indirection requires another pointer following, as well as a
> > > necessary check on the bitmap you are proposing, nevermind the
> > > additional memory overhead of having the item (generally doubling the
> > > size of dictionary objects that use such 'exported' items).
> > >
> > > You don't mention what algorithm/structure will allow for the accessing
> > > of your dictionary items in O(1) time, only that after you have this
> > > bitmap and dictioanry item thingy that it will be O(1) time (where
> > > dictionaries weren't already fast enough natively). I don't believe you
> > > have fully thought through this, but feel free to post C or Python
> > > source that describes your algorithm in detail to prove me wrong.
> >
> > Only access of exported items is O(1) time (when accessed via your
> > PyDictItem_obj->value), other items must be accessed normally and they
> > take just as much time (or as I explained and you reiterated, a tad
> > longer, as it requires a bitmap check and in the case of exported
> > items another dereference).
>
> But you still don't explain *how* these exported keys are going to be
> accessed.  Walk me through the steps required to improve access times in
> the following case:
>
> def foo(obj):
>     return obj.foo
>
>
I think you missed what I said - I said that the functionality should
probably not be exported to Python - as Python has little to gain from
it (it would have to getattr a C method just to request the exported
item -- which will nullify the speed benefit).
It is the C code which can suddenly do direct access to access the
exported dict items - not Python code.

> > > You should note that Python's dictionary implementation has been tuned
> > > to work *quite well* for the object attribute/namespace case, and I
> > > would be quite surprised if anyone managed to improve upon Raymond's
> > > work (without writing platform-specific assembly).
> >
> > Ofcourse - the idea is not to improve dict's performance with the
> > normal way it is accessed, but to change the way it is accessed for
> > the specific use-case of accessing static values in a static dict -
> > which can be faster than even a fast dict lookup.
>
> If I have a dictionary X, and X has exported keys, then whenever I
> access exported values in the dictionary via X[key], your proposed
> indirection will necessarily be slower than the current implementation.
That is true, I acknowledged that. It is even true also that access to
X[key] even when key is not exported is slower.  When I have a few
spare moments, I'll try and benchmark how much slower it is.

> > The dict lookups in globals, builtins are all looking for literal
> > static keys in a literal static dict. In this specific case, it is
> > better to outdo the existing dict performance, by adding a special way
> > to access such static keys in dicts - which insignificantly slows down
> > access to the dict, but significantly speeds up this very common use
> > pattern.
>
> Please benchmark before you claim "insignificant" performance
> degredation in the general case. I claim that adding a level of
> indirection and the accessing of a bit array (which in C is technically
> a char array with every bit getting a full char, unless you use masks
> and shifts, which will be slower still) is necessarily slower than the
> access characteristics of current dictionaries.  We can see this as a
> combination of an increase in the number of operations necessary to do
> arbitrary dictionary lookups, increased cache overhad of those lookups,
> as well as the delay and cache overhead of accessing the bit array.
You are right - we disagree there, but until I benchmark all words are moot.

> > Attribute lookups in the class dict are all literal/static key lookups
> > in a static dict (though in order for a code object to know that it is
> > a static dict, a psyco-like system is required. If such a system is
> > used, all of those dict lookups can be made faster as well).
>
> No, attribute lookups are not all literal/static key lookups.  See
> getattr/setattr/delattr, operations on cls.__dict__, obj.__dict__, etc.

I may have oversimplified a bit for the sake of explaining.  I was
referring to the operations that are taken by LOAD_ATTR, as an
example.
Lets analyze the LOAD_ATTR bytecode instruction:
 * Calls PyObject_GetAttr for the requested attribute name on the
request object.
 * PyObject_GetAttr redirects it to the type's tp_getattr[o].
 * When tp_getattr[o] is not overridden, this calls PyObject_GenericGetAttr.
 * PyObject_GenericGetAttr first looks for a method descriptor in
dicts of every class in the entire __mro__. If it found a
getter/setter descriptor, it uses that. If it didn't, it tries the
instance dict, and then uses the class descriptor/attr.

I believe this implementation to be very wasteful (specifically the
last step) and I have posted a separate post about this in python-dev.
There is work being done to create an mro cache for types which would
allow to convert the mro lookup to a single lookup in most cases. I
believe that this mro cache should become a single dict object inside
each type object, which holds a merge (according to mro order) of all
the dicts in its mro.  If this modification is done, then
PyObject_GenericGetAttr can become a lookup in the instance dict
(which, btw, can also disappear when __slots__ is used in the class),
and a lookup in the mro cache dict.  If this is the case, then
LOAD_ATTR, which is most often used with literal names, can (if the
type of the object being accessed is known [via a psyco-like system])
become a regular lookup on the instance dict, and a "static lookup" on
the class mro cache dict (which would use an exported dict item).
If the psyco-like system can even create code objects which are not
only specific to one type, but to a specific instance, even the
instance lookup of the literal attribute name can be converted to a
"static lookup" in the instance's dict.
Since virtually all LOAD_ATTR's are using literal strings, virtually
all of the class-side "dynamic lookups" can be converted to "static
lookups".
Since a "static lookup" costs a dereference and a conditional, and a
dynamic lookup entails at least 4 C function calls (including C stack
setup/unwinds), a few C assignments and C conditionals, I believe it
is likely that this will pay off as a serious improvement in Python's
performance, when combined with a psyco-like system (not an
architecture-dependent ones).

> From what I can gather (please describe the algorithm now that I have
> asked twice), the only place where there exists improvement potential is
> in the case of global lookups in a module.  That is to say, if a
> function/method in module foo is accessing some global variable bar, the
> compiler can replace LOAD_GLOBAL/STORE_GLOBAL/DEL_GLOBAL with an opcode
> to access a special PyDictItem object that sits in the function/method
> cell variables, rather than having to look in the globals dictionary
> (that is attached to every function and method).
As I explained above, there is room for improvement in normal
attribute lookups, however that improvement requires a psyco-like
system in order to be able to deduce which dicts are going to be
accessed by the GetAttr mechanism and then using static lookups to
access them directly.
With access to globals and builtins, this optimization is indeed
easier, and your description is correct, I can be a little more
specific:
* Each code object can contain a "co_globals_dict_items" and
"co_builtins_dict_items" attributes that refer to the
exported-dict-items for that literal name in both the globals and
builtins dict.
* Each access to a literal name in the globals/builtins namespace, at
the compilation stage, will request the globals dict and builtins dict
to create an exported item for that literal name. This exported item
will be put into the co_globals_dict_items/co_builtins_dict_items in
the code object.
* LOAD_GLOBAL will not be used when literal name are accessed.
Instead, a new bytecode instruction "LOAD_LITERAL_GLOBAL" with an
index to the "co_globals_dict_items/co_builtins_dict_items" tuples in
the code object.
* LOAD_LITERAL_GLOBAL will use the index to find the
PyExportedDictItem in those tuples, and look like (a bit more verbose
naming for clarity):
  case LOAD_LITERAL_GLOBAL:
     exported_dict_item = GETITEM(co->co_globals_dict_items, oparg);
     x = exported_dict_item->value;
     if(NULL == x) {
        exported_dict_item = GETITEM(co->co_builtins_dict_items, oparg);
        x = exported_dict_item->value;
        if(NULL == x) {
          format_exc_check_arg(PyExc_NameError, MSG,
GETITEM(co->co_globals_names, oparg));
          break;
        }
     }
     Py_INCREF(x);
     PUSH(x);
     continue;

I hope that with these explanations and some code snippets my
intentions are more clear.

>  - Josiah
>
>



More information about the Python-ideas mailing list