[Python-ideas] Proposal for extending the collections module - bags / multisets, ordered sets / unique lists

Michael Lenzen m.lenzen at gmail.com
Sat Jul 18 03:54:15 CEST 2009


In a nutshell, I want to add 2 new classes (and respective frozen 
varients) to the collections module, namely a bag (multiset) class and a 
setlist (ordered set/unique list) class.  I know this has been floated 
before, but I think it merits more consideration.

Everything I'm writing here is available with better formatting here: 
http://code.google.com/p/python-data-structures/wiki/collections
The latest version of my implementation is here: 
http://python-data-structures.googlecode.com/svn/trunk/collections_.py

I look at the basic data structures of Python like this (ignoring 
mutability which I will mention later) :

             | Unordered | Ordered |
-----------------------------------
Non-Unique  |           |   list  |
-----------------------------------
Unique      |    set    |         |
-----------------------------------

Python is missing 2 basic collections.  First, an unordered non-unique 
collection, a bag aka multiset, the simplest collection there is. You 
can't even really call them a data structure because there's no 
structure to them (which is their beauty). Secondly it's missing an 
ordered unique collection, an ordered set or unique list, what I'm 
calling a setlist.

That is the scope of when I started out, of course it grew and grew.  
The last addition but the first I will list, because I think it is the 
coolest and provides the most functionality, is the new function I 
propose adding to collections.  The function is simply called 
'collection' and has the following signature:

collection(iterable=None, mutable=False, ordered=False, unique=False)

What it does is return a collection (instantiated from iterable if 
passed) with the passed properties.  This will only be possible with 
then new classes I propose, because they don't exist yet.

New ABCs (abstract base classes)
--------------------------------
After starting to study the ABCs built in to Python I came up with 2 
more ABCs that I think should be included:

Collection - implements Sized, Iterable and Container.  Sequence, Set 
and Mapping would then all inherit Collection instead of the three 
aforementioned classes. Built-in classes that would be subclasses of 
Collection are set, frozenset, list and tuple.  Collection would provide 
2 methods:
   * _from_iterable(cls, it) - This is a classmethod that I took from 
Set, for convenience of returning new instantiations
   * __getitem__(key, value) - This is an abstract method. Every 
Collection would have to make its elements accessible. Of course set and 
frozenset do not currently implement this method so they will have to be 
changed, more later.

Mutable - I think there should be one metaclass for all mutable types 
just like there is currently Iterable, Sized and Container. We could 
then check instanceof(obj, Mutable). MutableSequence, MutableSet and 
MutableMapping would inherit Mutable, or we might be able to just do 
away with them altogether.  There are 3 abstractmethods defined:
   * pop(self) - This is the simplest to implement, all mutable classes 
already do.
   * __setitem__(self, key, value) - This is already implemented for 
list and dict but would have to be defined for set, more later
   * __delitem__(self, key) Same as above.

Extending set to fit the ABCs
-----------------------------
I mentioned before that sets don't currently fit into my model. Here are 
my propositions for the three methods that would need to be implemented.
   * set.__setitem__(self, elem, value) - Set whether or not elem is in 
the set based on the boolean evaluation of value. s[elem] = 1 would call 
s.add(elem) and s[elem] = False would call s.remove(elem), appropriately 
raising a KeyError if elem not in s. The reason I chose this way is that 
the only thing you can set about an element of a set is whether or not 
it is present.
   * __getitem__(self, item) - Simply returns item in self. Just like a 
bag returns the multiplicity of an item when you get it, this returns 
the number of times item appears in self. This also corresponds to the 
definition of setitem.
   * __delitem__(self, item) - This is the simplest case, just make it 
an alias for set.remove(item)

I think this is the most controversial part of my proposition, but the 
more I think about it, the more I like it. []s aren't reserved for one 
single purpose in Python, lists restrict the index to an integer while 
dicts allow any Hashable object. list allows slicing inside the []s 
while it is meaningless for a dict.

Finally, the New Classes
------------------------
These are the 2 new Collections (4 if you count the frozen variants) to 
fill in the holes I mentioned in the beginning.

bags - These would be true multisets, AKA unordered tuples.  See the 
wikipedia page on Multisets for more info. The benefits of a bag vs a 
list are that order doesn't matter for equality testing and that 
searching for an item is constant time. The latter is the real convincer 
for me.  I think this should replace the Counter class in the 
collections module which doesn't provide anything more than 
defaultdict(int) does.

setlist - AKA unique list, ordered set, bijection int->hashable, and 
ordered collection of unique elements.  I find myself often creating a 
list of unique elements, then an inverse dict to look up the location 
quickly or test inclusion.  I chose the name setlist because it's not a 
set xor a list, it's both. Then it occurred to me that bands' setlists 
are a perfect example of this data structure, an ordered collection of 
unique elements, real bands that is, 
http://en.wikipedia.org/wiki/Vertigo_Tour#The_encores

-Michael Lenzen



More information about the Python-ideas mailing list