[Python-ideas] Contiguous-array-based ordering for OrderedDict

Mon Dec 14 13:00:49 EST 2015

Despite all the nice discussion about runtime and so on and so forth.

Couldn't he just provide a working implementation to prove his point? 
Maybe, he finds out himself that it's not working as intended and in the 
course of doing so, he finds an even better solution.

Best,
Sven

On 14.12.2015 18:30, Andrew Barnert via Python-ideas wrote:
> On Dec 14, 2015, at 06:58, Serhiy Storchaka <storchaka at gmail.com> wrote:
>>> On 13.12.15 06:57, Andrew Barnert via Python-ideas wrote:
>>> Let's go back to the original message. He started off talking about Tim Peters' post, which explains the need for a linked list because in an array, searching for the value takes linear time, and then removing it in-place takes linear time.
>> A linked list searching for the value takes linear time too. To avoid it, an auxiliary hashtable is used.
> This is already covered in the very next paragraph.
>
>> In current Python implementation it maps key to a list node, in proposed implementation it maps key to an index in a continuous array. No linear search anymore, this problem already is solved.
>>
>>> He suggested fixing that by batching up the deletes. That solves the cost of removing, but it doesn't solve the problem with the linear search at all--in fact, it makes things worse, because you're now searching a sparse array. That's what I was commenting on here.
>>>
>>> He later suggested maybe using a second hash for indices, as the current design does. This of course does solve the problem of the search, but then you have to change half the hash values for each shift, which makes that part even worse. I commented on that further below.
>> What do you mean saying about "change half the hash values for each shift" and why this is a problem?
> If you delete a linked list node, none of the other list nodes change. So, if you have hash entries pointing at all of the list nodes, only one hash entry has to be changed for a delete.
>
> If you delete an array element, all of the subsequent indices change. So, if you have hash entries pointing at the indices, on average, half of the hash entries have to be changed for a delete.
>
> And if you're about to say "but combining the two solves that problem", that's already in the very next paragraph, so please read on.
>
>>> What about combining the two? He didn't mention that, but if you do that, every time you compact the array, you have to do a hash rebuild. Yes, a normal dict does this, but only logarithmically often, not every N/2 operations.
>> Let's distinguish between two scenarios. In the first scenario, the dictionary is only growing. Every O(N) additions you need to resize and rebuild a hashtable of size O(N).
> If that were true, dicts would be amortized linear time. You do O(N) work O(N) times, divided over O(N) operations, which means amortized O(N*N/N), which is linear.
>
> The fact that dict (and list) grow exponentially rather than linearly is critical. It means that, even in the case of nothing but inserts, you only rehash O(log N) times, not O(N), and the work done each time also only grows logarithmically. The total work done for all resizes ends up being just a constant times N (think about the sum of 1 + 1/2 + 1/4 + 1/8 + ... vs. 1 + 9/10 + 8/10 + ...), so, divided by N operations, you get amortized O(C*N/N), which is constant.
>
>> This gives amortized constant time of addition. In proposed OrderedDict design you need also resize an array and resize and rebuild an auxiliary hashtable, both have linear complexity. Resulting amortized time is constant again.
> Sure, but this just shows that the fact that deletes are linear time doesn't matter if you never do any deletes.
>
>> In the second scenario, the dictionary has equal number of additions and deletions, and its size is not changed. For a dict it dousn't add anything to base cost of addition and deletion. In proposed OrderedDict design, since almost every addition append an item to an array, periodical repacking is needed. It has linear cost (repacking an array and updating an auxiliary hashtable). Total amortized cost of one additional or deletion operations is constant.
> No, deletion operations do O(N) work O(M) times amortized over M, where N is the dict size and M the number of operations. It's linear in the max size of the dict.
>
> In the special case where the dict never changes size, you could just declare that a constant--but that's pretty misleading. For one thing, it's usually going to be a very large constant. For example, if you start with 1000 elements and do 2000 deletes and 2000 inserts, your constant is N/4. More importantly, a constant-sized dict with lots of mutations is a pretty rare use case, and in the more common case where there are more inserts than deletes, so the dict is growing, the delete times are increasing with the dict size. Unless the ratio of deletes to inserts is falling at the same rate (another rare special case), this isn't going to be constant.
>
> And if you're wondering how dict gets away with this, it's kind of cheating by just not generally shrinking on deletes, but that's a cheat that works just fine for the majority of real-life use cases (and it's hard to work around in the vast majority of cases where it doesn't). That obviously won't work here--if a dict later regrows, it automatically uses up all the deleted buckets, but if an array-based OrderedDict later regrows, it will just append onto the end, leaving all the holes behind. You have no choice but to compact those holes at some point, and there's no way to make that constant time in an array.
>
>> The implementation with continuous array has about the same complexity as the implementation with linked list. The difference is only in constant multiplier, and without real C code we can't determine what constant is lower.
> If your argument were correct, you could always simulate a linked list with an array with only a constant-factor cost, and that just isn't true.
>
>> Actually the current C implementation of OrderedDict uses an continuous array for mapping an index in the base hashtable to a list node. It is rebuild from a linked list when the base hashtable is rebuild. I where planned to experiment with getting rid of linked list at all.
>>
>>>>>> All this looks similar to Raymond's proof-of-concept for a compact dictionary (ordering is a side effect). [1]
>>>>> But his design only preserves ordering if you never delete (or move, but he doesn't have an API for that).
>>>> This is only because preserving ordering is not a goal for a dict.
>>> Sure. But it is the goal for OrderedDict. Which is exactly why a design that makes sense for dict doesn't necessarily make sense for OrderedDict, unless you add something else. The hash table of list nodes works as such a something else. But Raymond's array doesn't (and isn't intended to).
>>>
>>>> Moving the last entry to the place of deleted one is the simplest design if don't care about ordering.
>>>>
>>>> But it is not hard to make the ordering be preserved (at the cost of larger amortized constant time and memory usage). Deleted entry should be just marked as deleted, and the storage should be packed only when a half (or 1/3, or 1/4) of entries is deleted.
>>> The entire benefit of Raymond's design is that it lets you store, and iterate, a dense array instead of a sparse one. If you're going to make that array every sparser than the hash table, you've lost all of the benefits. (But you're still paying the cost of an extra index or pointer, extra dereference, etc.)
>> The benefit of Raymond's design is that it lets you iterate a continuous array instead of jump forward and backward on randomly distributed in memory list nodes.
> The benefit of Raymond's design over the existing dict, as he explains in the ActiveState post, is that you can iterate (and store some of the data in) the dense array rather than iterating the hash table (which is a sparse array). He can maintain that denseness even in the face of deletions because he isn't trying to preserve order, so he can always just swap the last value into the deleted slot. If you try to use the same design to preserve order, you can't do that. If you instead mark deleted slots as deleted, and recompact every time it gets to 50% deleted, then you're iterating a sparse array that's more sparse than the hash table, so you do not get the time (or space) benefits over a dict that Raymond's design offers.
>
>> And on modern computers iterating even sparse array can be faster that iterating linked list. [*]
> Now you're proposing a completely different theoretical benefit from the one Raymond designed for and explained, so the analogy with his design doesn't buy you anything. You'd be better off sticking to analogy with the current plain dict, which, like your design, iterates over a sparse array, and is simpler to explain and analyze.
>
>> This is not the only benefit of Raymond's design, the other is presumably more compact representation.
> Yes, if you can move each 48-byte hash bucket into a dense array, leaving only a 4-byte integer in the sparse one in the hash table, then you save 48 * load - 4 bytes per entry. But the design under discussion doesn't get that, because you're putting the large data in an array that's even sparser than the hash table, so you're losing space.
>
> One thing that might work is a linked list of arrays (and start offsets): to delete from the middle of a node, you just split it in two at that point, so your cost is only the cost of copying half a node rather than half the structure. This obviously has the same worst-case behavior as a linked list, but in most uses the constant multiplier is closer to that of an array (especially if tuned so the nodes often fill a cache line, memory page, disk block, or whatever else is most important for the specific use), which is why it's used for C++ std::deque, many filesystem structures, database blobs, etc.
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/