Mailman 3 Contiguous-array-based ordering for OrderedDict - Python-ideas

Dec. 12, 2015

      Here's Tim Peters's 2013 explanation of OrderedDict:
http://stackoverflow.com/a/18951209/2963903

He said:
...
But if [the order list] were a Python list, deleting a key would take O(n) time twice over: O(n) time to find the key in the list, and O(n) time to remove the key from the list.
I didn't see why this would need to be the case.
1. If you start searching from the front, then each element would be
inspected or shifted, never both. This would mean only one full
iteration.
2. You don't have to shift for each deletion. You can wait until some
threshold is reached before shifting, and maybe that will spread out
the cost of shift among deletions enough to make an impact.
3. He then mentions that the linkedlist implementation uses a second
dict to map keys to their list nodes. Well, that would solve the cost
of lookup for lists, too.

So I ended up taking 3.5.0's OrderedDict and implemented the idea. I'm
not really sure what to do next, in terms of testing and comparing it
against the existing implementation, and whether it's worth pushing
for such a design for OrderedDict (current Python version or future C
cersion). It's enough for me if I can figure out how this compares in
performance, both algorithmically and actually, to the current
implementation.

--------------------------------------------------------------------

First idea: (link: http://pastebin.com/LESRktJw)

    Data members (with ListDict() as od):
    - super() : dict[key] => value
    - od.__map : dict[key] => index where od.__list[index] = key
    - od.__list : list[index] => key or `_sentinel`
    - od.__size : int - number of non-sentinel elements in list (aka len(dict))
        - This might be replaceable by len(self).

    Some algorithms to note:
    - .__setitem__: Add it to the end of __list if it's new.
    - .__delitem__: Replace the __list entry with `sentinel`, and
attempt to compact the list.
    - .__compact: Called when items are removed. Naive threshold: If
the list is 50% empty, then make a new list and update the indices for
each key.
    - (iterator views): Simply yield things that aren't `sentinel`.
    - .move_to_end(last=False): The biggest loss. Adding to the front
of the list will shift potentially many things up. Maybe if the list
were a dequeue, this would be faster.

    The actual length of __list will slow down iteration through the
list, but the factor of O(n) is dependent on the threshold value. And
the code itself MIGHT be faster per-element, and maybe (big maybe)
easier for the hardware.

    Needs special review:
    - __sizeof__: I have no idea if I did this right.
    - (iterators): Can they be made to sometimes detect
additions/removals during iteration?
    - Exception safety: E.g. in __setitem__, the order of updates for
the metadata might leave the dict in an invalid state.

    Also needs review:
    - I put a lot of global calls as default args, but the existing
use of default args for static name lookup seemed inconsistent.

--------------------------------------------------------------------

Second idea: thicker wrapper.

    Store a key,value,index triple in the internal dictionary, instead
of just the value.

    This will have a wrapper cost on __getitem__ (have to get the
value from the triple), but will save many lookups for adding and
removing elements (which is probably the less-common use).

    (At the C level, it should be possible to keep multiple storage
dictionaries in lockstep. Also, the triple doesn't need to be exposed
to the Python interface, right?)

    Type:
        class _DictStorage(object):
            __slots__ = ['key', 'value', 'index']

    Members:
        (external): ListDict[key] => value
        super() : dict[key] => [key, value, index]
        .__list : list[i] => [key, value, index] (same as above) or None
        .__size : as in the other implementation. Probably just as unnecessary.
        .__map : (removed; unneeded)

    Some method implementations:
        .__getitem__:
            Grab the [k,v,i] from the inner dict and unwrap it.
        .__setitem__:
            If the triple exists, update its value.
            If it doesn't exist, add in [k, v, len(__list)-1]
                and append it to __list.
            (This saves a lookup in the internal dict.)
        .__delitem__:
            inner.pop(key), and use the index to None it on the list.
        (iteration):
            Same as the first ListDict implementation, except using None
            instead of _sentinel.
        .move_to_end, .__compact:
            Same as the first ListDict, except I won't have to
            dict.get the keys to update their indices.

Third idea: Well, I have to learn PyPy's implementation of OrderedDict.

Contiguous-array-based ordering for OrderedDict

tags

participants (10)