[Python-Dev] collections.sortedtree

Fri Mar 28 22:29:51 CET 2014

On Mar 26, 2014, at 1:31 PM, Marko Rauhamaa <marko at pacujo.net> wrote:

> I have made a full implementation of a balanced tree and would like to
> know what the process is to have it considered for inclusion in Python
> 3.
> 
> To summarize, the implementation closely parallels dict() features and
> resides in _collectionsmodule.c under the name collections.sortedtree.
> It uses solely the "<" operator to compare keys. I have chosen the AVL
> tree as an implementation technique.

FWIW, I think there may be room for a sorted collection somewhere in the
standard library.

As others have said, the best place to start is by putting a module on PyPi
to let it mature and to compete with other approaches to the problem.

Here are a few random thoughts on the over-all idea:

* An AVL balanced tree isn't the only solution or necessarily the best solution
to the problem.  Tree nodes tend to take more space than denser
structures and they have awful cache locality (these are the same reasons
that deques use doubly-linked blocks rather than a plain doubly linked lists).

* Another approach I have experimented with uses lazy sorting.  That lets
insertion be an O(1) step and incurs a one-time sorting cost upon the next
lookup (and because Timsort exploits partial orderings, this can be very
cheap).  A lazy sorted list is dense and sequentially ordered in memory
(reducing space overhead, taking full advantage of cache locality and memory
controller auto-prefetch, and providing fast iteration speed by not needing
to chase pointers).  The lazy sort approach works best in applications that
spend most of the time doing lookups and have only infrequent deletions
and insertions.

* The name of the tool probably should not be sortedtree. Ideally, the tool
should be named for what it does, not how it does it (for the most part,
users don't need to know whether the underlying implementation is
a red-black tree, b-tree, judy array, sqlite database, or lazy list).  That
is why (I think) that Python dicts are called dicts rather than hash tables
(the implementation) or an associative array (the computer science term
for the abstract datatype).

* There are plenty of data structures  that have had good utility and
popularity outside of the standard library.  I that it is a good thing that
blists, numpy arrays, databases, and panda dataframes all live outside
the standard library.  Those tools are easier to maintain externally and
it is easier for you to keep control over the design.  Remember the saying,
"the standard library is where code goes to die" (and I would add that it
should already be mature (or nearly dead) by the time it gets there). 

* That said, it is a reasonable possibility that the standard library would
benefit from some kind sorted collection (the idea comes up from time
to time).

 Raymond Hettinger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140328/aba86088/attachment.html>