[Python-ideas] Contiguous-array-based ordering for OrderedDict

Mon Dec 14 09:58:17 EST 2015

On 13.12.15 06:57, Andrew Barnert via Python-ideas wrote:
> Let's go back to the original message. He started off talking about Tim Peters' post, which explains the need for a linked list because in an array, searching for the value takes linear time, and then removing it in-place takes linear time.

A linked list searching for the value takes linear time too. To avoid 
it, an auxiliary hashtable is used. In current Python implementation it 
maps key to a list node, in proposed implementation it maps key to an 
index in a continuous array. No linear search anymore, this problem 
already is solved.

> He suggested fixing that by batching up the deletes. That solves the cost of removing, but it doesn't solve the problem with the linear search at all--in fact, it makes things worse, because you're now searching a sparse array. That's what I was commenting on here.
>
> He later suggested maybe using a second hash for indices, as the current design does. This of course does solve the problem of the search, but then you have to change half the hash values for each shift, which makes that part even worse. I commented on that further below.

What do you mean saying about "change half the hash values for each 
shift" and why this is a problem?

> What about combining the two? He didn't mention that, but if you do that, every time you compact the array, you have to do a hash rebuild. Yes, a normal dict does this, but only logarithmically often, not every N/2 operations.

Let's distinguish between two scenarios. In the first scenario, the 
dictionary is only growing. Every O(N) additions you need to resize and 
rebuild a hashtable of size O(N). This gives amortized constant time of 
addition. In proposed OrderedDict design you need also resize an array 
and resize and rebuild an auxiliary hashtable, both have linear 
complexity. Resulting amortized time is constant again.

In the second scenario, the dictionary has equal number of additions and 
deletions, and its size is not changed. For a dict it dousn't add 
anything to base cost of addition and deletion. In proposed OrderedDict 
design, since almost every addition append an item to an array, 
periodical repacking is needed. It has linear cost (repacking an array 
and updating an auxiliary hashtable). Total amortized cost of one 
additional or deletion operations is constant.

The implementation with continuous array has about the same complexity 
as the implementation with linked list. The difference is only in 
constant multiplier, and without real C code we can't determine what 
constant is lower.

Actually the current C implementation of OrderedDict uses an continuous 
array for mapping an index in the base hashtable to a list node. It is 
rebuild from a linked list when the base hashtable is rebuild. I where 
planned to experiment with getting rid of linked list at all.

>>>> All this looks similar to Raymond's proof-of-concept for a compact dictionary (ordering is a side effect). [1]
>>>
>>> But his design only preserves ordering if you never delete (or move, but he doesn't have an API for that).
>>
>> This is only because preserving ordering is not a goal for a dict.
>
> Sure. But it is the goal for OrderedDict. Which is exactly why a design that makes sense for dict doesn't necessarily make sense for OrderedDict, unless you add something else. The hash table of list nodes works as such a something else. But Raymond's array doesn't (and isn't intended to).
>
>> Moving the last entry to the place of deleted one is the simplest design if don't care about ordering.
>>
>> But it is not hard to make the ordering be preserved (at the cost of larger amortized constant time and memory usage). Deleted entry should be just marked as deleted, and the storage should be packed only when a half (or 1/3, or 1/4) of entries is deleted.
>
> The entire benefit of Raymond's design is that it lets you store, and iterate, a dense array instead of a sparse one. If you're going to make that array every sparser than the hash table, you've lost all of the benefits. (But you're still paying the cost of an extra index or pointer, extra dereference, etc.)

The benefit of Raymond's design is that it lets you iterate a continuous 
array instead of jump forward and backward on randomly distributed in 
memory list nodes. And on modern computers iterating even sparse array 
can be faster that iterating linked list. [*]

This is not the only benefit of Raymond's design, the other is 
presumably more compact representation.

[*] Of course needed benchmarks to prove this.