[Python-3000] Two proposals for a new list-like type: one modest, one radical

Mon Apr 23 17:56:54 CEST 2007

I have two proposals for Python 3000 that I'll write PEP(s) for if
there's interest.  The first proposal is, I think, a modest one.  The
second proposal is related, but more radical.  I fully expect the
second proposal to be rejected for that alone, especially since I am a
relatively an outsider to the Python developer community (though it
was great to meet some of you at PyCon this year).  I bring up the
more radical proposal primarily for completeness.

The Modest Proposal
-------------------------------

As a pet project, I've written a container type that looks, acts, and
quacks like a Python list(), but has better asymptotic performance.
Specifically, O(log n) time for inserts and deletes in the middle of
the list.  I'd like to offer it for inclusion in the collections
module.

Probably, you have encountered some other proposals for types that
meet this description.  Here are some of the unique features of mine,
called a BList, which is based on B+Trees.  I'm open to suggestions
for a better name.

- just as fast as a Python list() when the list is small
- getslice runs in O(log n) time
- making a shallow copy runs in O(1) time
- setslice runs in O(log n + log k) time if the inserted slice is a
BList of length k
- multipling a BList by k takes O(log k) time and O(log k) memory

Example:

>>> from blist import *
>>> x = BList([0])
>>> x *= 2**29
>>> x.append(5)
>>> y = x[4:-234234]
>>> del x[3:1024]

None of the above operations have a noticeable delay, even though the
lists have over 500 million elements due to line 3.

The BList has two key features that allow it to pull this off this performance:

1) The BList type features transparent copy-on-write.  If a non-root
node needs to be copied (as part of a getslice, copy, setslice, etc.),
the node is shared between multiple parents instead of being copied.
If it needs to be modified later, it will be copied at that time.
This is completely behind-the-scenes; from the user's point of view,
the BList works just like a regular Python list.

2) A B+Tree is a wide, squat tree.  The maximum number of children per
node is defined by a compile-time parameter (LIMIT).  If the list is
shorter than LIMIT, then there is only one node, which simply contains
an array of the objects in the list.  In other words, for short lists,
a BList works just like Python's array-based list() type.  Thus, it
has the same good performance on small lists.

So you can see the performance of the BList in more detail, I've made
several performance graphs available at the following link:
   http://stutzbachenterprises.com/blist/

Each webpage shows two graphs:
 - The raw execution time for the operation as a function of the list size
 - The execution time for the operation, normalized by the execution
time of Python's list()

These graphs were made using LIMIT=128.

The only operation where list() is much faster for small lists is
getitem through the x[5] syntax.  There's a special case for that
operation deep inside the Python bytecode interpreter.  My extension
module can't compete with that. :-)  BList is just as fast as list()
for small lists via .__getitem__().

The source code for the type is available for compilation as an external module:
    http://stutzbachenterprises.com/blist/blist.tar.gz

I wrote it against Python 2.5.

The Radical Proposal
-------------------------------

Replace list() with the BList.

There are only a few use-cases (that I can think of) where Python's
list() regularly outperforms the BList.  These are:

1. A large LIFO stack, where there are many .append() and .pop(-1)
operations.  These are O(1) for a Python list, but O(log n) for the
BList().  (BList's are just as fast for small lists)

2. An large, unchanging list, where there are many getitem() calls,
and none of the inserts or deletes where the BList() shines.  Again,
these are O(1) for a Python list, but O(log n) for the BList(), and
the BList() is just as fast if the list is small.

Use case #1 is covered by collections.deque.  Use case #2 is covered by tuples.

When I first started learning Python, I often ended up inadvertently
writing O(n**2) algorithms by putting a O(n) list operation inside a
loop (which is a problem when n=100,000).  Since nothing in the
signature of list() informs the user when an operation is going to be
slow, it took me a while to learn how to write my code to work around
this constraint.  In fact, I initially assumed "list" meant "linked
list" so most of my assumptions about which operations would be fast
vs. slow were backwards.  I think it would be more Pythonic if the
user can trust that the implementation will always be fast (at least
O(log n)).

I recognize that the Python developer community is understandably very
attached to the current array-based list, so I expect this to get shot
down.  I hope this doesn't reflect badly on the more modest proposal
of including a new type in the collections module.  Also, please don't
kill the messenger. :-)

-- 
Daniel Stutzbach, Ph.D.             President, Stutzbach Enterprises LLC