[Python-3000] ordered dict for p3k collections?

Wed Sep 26 16:25:13 CEST 2007

On 9/26/07, Mark Summerfield <mark at qtrac.eu> wrote:
> On 2007-09-26, skip at pobox.com wrote:
> >     Mark> I have put a new version (incorporating another implementation
> >     Mark> idea from Paul Hankin) on PyPI:
> >
> >     Mark> http://pypi.python.org/pypi/sorteddict
> >
> > From that:
> >
> >     The main benefit of sorteddicts is that you never have to explicitly
> >     sort.
> >
> > Surely there must be something more than that.  Wrapping sorted() around a
> > keys() or values() call is a pretty trivial operation.  I didn't see that
> > the implementation saved anything.
>
> Assuming you have a good sorteddict implementation (i.e., based on a
> balanced tree or a skiplist, not the one I've put up which is just
> showing the API) you can gain significant performance benefits.

That depends very much on the use case, and in general I strongly
doubt it. I haven't looked this up in Knuth, but I believe that in a
sorted dict implementation, the best performance you can get for
random access and random insertions is O(log N), which is always beat
by the O(1) of a hash table. This translates in O(N log N) for
inserting N elements into a sorted dict, vs. O(N) in a hash table.
Sorted traversal is O(N) for the sorted dict and O(N log N) for the
hash table. So in order to gain a "significant performance benefit"
you'd have to have one pass of insertions and two traversals with a
small number of insertions or deletions in between (otherwise the
sorted result from the hash table could just be cached).

I don't believe that this pattern is common enough, but I don't know
your application.

> For example, if you have a large dataset that you need to traverse quite
> frequently in sorted order, calling sorted() each time will be expensive
> compared to simply traversing an intrinsically sorted data structure.
>
> When I program in C++/Qt I use QMap (a sorteddict) very often; the STL
> equivalent is called map. Both the Qt and STL libraries have dict
> equivalents (QHash and unordered_map), but my impression is that the
> sorted data structures are used far more frequently than the unsorted
> versions.

Perhaps out of ignorance? Or perhaps the hash implementations have
suboptimal implementations? Or perhaps because no equivalent to
sorted() exists?

> If you primarily program in Python, using dict + sorted() is very
> natural because they are built into the language. But using a sorted
> data structure and never sorting is a very common practice in other
> languages.

Ah, now the real reason you want this so badly is finally clear:
simply because you're more familiar with it. :-)

Is the number of elements in a typical use case large enough that the
performance difference even matters?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)