[Python-3000] ordered dict for p3k collections?

Mark Summerfield mark at qtrac.eu
Wed Sep 26 17:27:25 CEST 2007


On 2007-09-26, Guido van Rossum wrote:
> On 9/26/07, Mark Summerfield <mark at qtrac.eu> wrote:
> > On 2007-09-26, skip at pobox.com wrote:
> > >     Mark> I have put a new version (incorporating another
> > > implementation Mark> idea from Paul Hankin) on PyPI:
> > >
> > >     Mark> http://pypi.python.org/pypi/sorteddict
> > >
> > > From that:
> > >
> > >     The main benefit of sorteddicts is that you never have to
> > > explicitly sort.
> > >
> > > Surely there must be something more than that.  Wrapping sorted()
> > > around a keys() or values() call is a pretty trivial operation.  I
> > > didn't see that the implementation saved anything.
> >
> > Assuming you have a good sorteddict implementation (i.e., based on a
> > balanced tree or a skiplist, not the one I've put up which is just
> > showing the API) you can gain significant performance benefits.
>
> That depends very much on the use case, and in general I strongly
> doubt it. I haven't looked this up in Knuth, but I believe that in a
> sorted dict implementation, the best performance you can get for
> random access and random insertions is O(log N), which is always beat
> by the O(1) of a hash table. This translates in O(N log N) for
> inserting N elements into a sorted dict, vs. O(N) in a hash table.
> Sorted traversal is O(N) for the sorted dict and O(N log N) for the
> hash table. So in order to gain a "significant performance benefit"
> you'd have to have one pass of insertions and two traversals with a
> small number of insertions or deletions in between (otherwise the
> sorted result from the hash table could just be cached).

I'm sure your numbers are right. It seems to me that the trade off is
this: with dict + sorted() you pay O(N log N) whenever you need to sort
(okay, Python is optimised for sorting partially ordered data so
probably is better than the theoretical best). With sorteddict you pay
O(log N) for accessing, but you pay nothing for sorting.

> I don't believe that this pattern is common enough, but I don't know
> your application.

> > For example, if you have a large dataset that you need to traverse quite
> > frequently in sorted order, calling sorted() each time will be expensive
> > compared to simply traversing an intrinsically sorted data structure.
> >
> > When I program in C++/Qt I use QMap (a sorteddict) very often; the STL
> > equivalent is called map. Both the Qt and STL libraries have dict
> > equivalents (QHash and unordered_map), but my impression is that the
> > sorted data structures are used far more frequently than the unsorted
> > versions.
>
> Perhaps out of ignorance? Or perhaps the hash implementations have
> suboptimal implementations? Or perhaps because no equivalent to
> sorted() exists?

C++ provides sorting algorithms that can be applied to STL containers
(or Qt containers which also has its own sorting algorithms), so these
do exist.

> > If you primarily program in Python, using dict + sorted() is very
> > natural because they are built into the language. But using a sorted
> > data structure and never sorting is a very common practice in other
> > languages.
>
> Ah, now the real reason you want this so badly is finally clear:
> simply because you're more familiar with it. :-)

That is true!

> Is the number of elements in a typical use case large enough that the
> performance difference even matters?

I don't know. In C++ I use QMap or map so my ordering is free and I
never notice the extra cost of lookup compared with a hash. In Python, I
can only compare theoretically, not empirically because I'd need a
sorteddict that was as well implemented as dict is.

I'll leave sorteddict on PyPI for those sad souls who want it, and I'll
try to think "dict + sorted()" for Python.

Of course this might be academic for Python 3, at least for strings
(unless you implement some kind of string comparison normalisation
method), since two strings that are the same to humans may be different
byte sequences which rather makes "sorting" a moot point.

-- 
Mark Summerfield, Qtrac Ltd., www.qtrac.eu



More information about the Python-3000 mailing list