Proposal for extending the collections module - bags / multisets, ordered sets / unique lists

In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration. Everything I'm writing here is available with better formatting here: http://code.google.com/p/python-data-structures/wiki/collections The latest version of my implementation is here: http://python-data-structures.googlecode.com/svn/trunk/collections_.py I look at the basic data structures of Python like this (ignoring mutability which I will mention later) : | Unordered | Ordered | ----------------------------------- Non-Unique | | list | ----------------------------------- Unique | set | | ----------------------------------- Python is missing 2 basic collections. First, an unordered non-unique collection, a bag aka multiset, the simplest collection there is. You can't even really call them a data structure because there's no structure to them (which is their beauty). Secondly it's missing an ordered unique collection, an ordered set or unique list, what I'm calling a setlist. That is the scope of when I started out, of course it grew and grew. The last addition but the first I will list, because I think it is the coolest and provides the most functionality, is the new function I propose adding to collections. The function is simply called 'collection' and has the following signature: collection(iterable=None, mutable=False, ordered=False, unique=False) What it does is return a collection (instantiated from iterable if passed) with the passed properties. This will only be possible with then new classes I propose, because they don't exist yet. New ABCs (abstract base classes) -------------------------------- After starting to study the ABCs built in to Python I came up with 2 more ABCs that I think should be included: Collection - implements Sized, Iterable and Container. Sequence, Set and Mapping would then all inherit Collection instead of the three aforementioned classes. Built-in classes that would be subclasses of Collection are set, frozenset, list and tuple. Collection would provide 2 methods: * _from_iterable(cls, it) - This is a classmethod that I took from Set, for convenience of returning new instantiations * __getitem__(key, value) - This is an abstract method. Every Collection would have to make its elements accessible. Of course set and frozenset do not currently implement this method so they will have to be changed, more later. Mutable - I think there should be one metaclass for all mutable types just like there is currently Iterable, Sized and Container. We could then check instanceof(obj, Mutable). MutableSequence, MutableSet and MutableMapping would inherit Mutable, or we might be able to just do away with them altogether. There are 3 abstractmethods defined: * pop(self) - This is the simplest to implement, all mutable classes already do. * __setitem__(self, key, value) - This is already implemented for list and dict but would have to be defined for set, more later * __delitem__(self, key) Same as above. Extending set to fit the ABCs ----------------------------- I mentioned before that sets don't currently fit into my model. Here are my propositions for the three methods that would need to be implemented. * set.__setitem__(self, elem, value) - Set whether or not elem is in the set based on the boolean evaluation of value. s[elem] = 1 would call s.add(elem) and s[elem] = False would call s.remove(elem), appropriately raising a KeyError if elem not in s. The reason I chose this way is that the only thing you can set about an element of a set is whether or not it is present. * __getitem__(self, item) - Simply returns item in self. Just like a bag returns the multiplicity of an item when you get it, this returns the number of times item appears in self. This also corresponds to the definition of setitem. * __delitem__(self, item) - This is the simplest case, just make it an alias for set.remove(item) I think this is the most controversial part of my proposition, but the more I think about it, the more I like it. []s aren't reserved for one single purpose in Python, lists restrict the index to an integer while dicts allow any Hashable object. list allows slicing inside the []s while it is meaningless for a dict. Finally, the New Classes ------------------------ These are the 2 new Collections (4 if you count the frozen variants) to fill in the holes I mentioned in the beginning. bags - These would be true multisets, AKA unordered tuples. See the wikipedia page on Multisets for more info. The benefits of a bag vs a list are that order doesn't matter for equality testing and that searching for an item is constant time. The latter is the real convincer for me. I think this should replace the Counter class in the collections module which doesn't provide anything more than defaultdict(int) does. setlist - AKA unique list, ordered set, bijection int->hashable, and ordered collection of unique elements. I find myself often creating a list of unique elements, then an inverse dict to look up the location quickly or test inclusion. I chose the name setlist because it's not a set xor a list, it's both. Then it occurred to me that bands' setlists are a perfect example of this data structure, an ordered collection of unique elements, real bands that is, http://en.wikipedia.org/wiki/Vertigo_Tour#The_encores -Michael Lenzen

Excuse my ignorance, but it's been a while since I've taken an algorithms class: Would a bag have any Big-O performance advantage over just using a list but ignoring the fact that it's in an order? Non-big O but otherwise practical speed advantages? As for the ordered set, there I can see the uses much more clearly, and the proposal to included it has come up on this list before. At the same time though, I think some of those uses can also be realized using the new collections.Counter class. Then again, if there's interest, why not? -- Carl Johnson

A Counter class is NOT a multiset, it allows negative and zero elements. This makes as much sense to me as having negative elements in a Set. This is all besides the fact that Counter is nothing more than defaultdict(int) so it should be removed regardless of whether or not bags are added. -Michael Lenzen On 07/17/2009 09:36 PM, Carl Johnson wrote:
Excuse my ignorance, but it's been a while since I've taken an algorithms class: Would a bag have any Big-O performance advantage over just using a list but ignoring the fact that it's in an order? Non-big O but otherwise practical speed advantages?
As for the ordered set, there I can see the uses much more clearly, and the proposal to included it has come up on this list before. At the same time though, I think some of those uses can also be realized using the new collections.Counter class. Then again, if there's interest, why not?
-- Carl Johnson _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Sorry, the previous post was in the wrong thread. Yes, checking for inclusion (elem in bag) and removing an elem would be O(1) time as opposed to O(n). I think the former is the real case for bags over lists. As for real world applications it makes a huge difference in a program I am running, I am repeatedly searching a collection of almost 1 million elements for an item and it takes a long time using a list. A list just iterates through all of the elements until it finds what it's looking for whereas a bag just hashes the element. -Michael Lenzen On 07/17/2009 09:36 PM, Carl Johnson wrote:
Excuse my ignorance, but it's been a while since I've taken an algorithms class: Would a bag have any Big-O performance advantage over just using a list but ignoring the fact that it's in an order? Non-big O but otherwise practical speed advantages?
As for the ordered set, there I can see the uses much more clearly, and the proposal to included it has come up on this list before. At the same time though, I think some of those uses can also be realized using the new collections.Counter class. Then again, if there's interest, why not?
-- Carl Johnson _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
A Counter class (aka bag/multiset) and an OrderedDict class were added to collections for 3.1. These classes will also be available in 2.7. Counter *is* a bag implementation, while OrderedDict can trivially be used as the basis for an OrderedSet implementation. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Sat, 18 Jul 2009 12:58:18 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
A Counter class (aka bag/multiset) and an OrderedDict class were added to collections for 3.1. These classes will also be available in 2.7.
Counter *is* a bag implementation, while OrderedDict can trivially be used as the basis for an OrderedSet implementation.
If that's good enough, why do we have set? More importantly, will the ones that you don't have to write yourself all have the same API? It would be nice to be able to change between the four cases as the requirements change without having to audit every use of them to turn the spelling of add_element_to_type_x into add_element_to_type_y, and so on. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

I posted this in the wrong thread before, sorry. A Counter class is NOT a multiset/bag, it allows negative and zero elements. This makes as much sense to me as having negative elements in a Set. This is all besides the fact that Counter is nothing more than defaultdict(int) so it should be removed regardless of whether or not bags are added. -Michael Lenzen On 07/17/2009 09:58 PM, Nick Coghlan wrote:
Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
A Counter class (aka bag/multiset) and an OrderedDict class were added to collections for 3.1. These classes will also be available in 2.7.
Counter *is* a bag implementation, while OrderedDict can trivially be used as the basis for an OrderedSet implementation.
Cheers, Nick.

On 07/17/2009 09:58 PM, Nick Coghlan wrote:
Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
A Counter class (aka bag/multiset) and an OrderedDict class were added to collections for 3.1. These classes will also be available in 2.7.
Counter *is* a bag implementation, while OrderedDict can trivially be used as the basis for an OrderedSet implementation. On Fri, Jul 17, 2009 at 9:57 PM, Michael Lenzen<m.lenzen@gmail.com> wrote: I posted this in the wrong thread before, sorry.
A Counter class is NOT a multiset/bag, it allows negative and zero elements. This makes as much sense to me as having negative elements in a Set. This is all besides the fact that Counter is nothing more than defaultdict(int) so it should be removed regardless of whether or not bags are added.
Truth be told, it's more than just defaultdict(int). It adds .elements() and .most_common(). Seems bag-like to me. - Unordered? Check. - Allows duplicates? Check. - O(1) containment test? Check. - Counts multiplicity of elements? Check. - Iterable? Check. The only non-bag thing about it is allowing 0 and negative multiplicities, which I agree is unintuitive; I don't like that "feature" either. +0.75 on adding a proper Bag. It could be subclassed from or borrow code from Counter. Cheers, Chris -- http://blog.rebertia.com

On Fri, Jul 17, 2009 at 11:18 PM, Chris Rebert<pyideas@rebertia.com> wrote:
On 07/17/2009 09:58 PM, Nick Coghlan wrote:
Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
A Counter class (aka bag/multiset) and an OrderedDict class were added to collections for 3.1. These classes will also be available in 2.7.
Counter *is* a bag implementation, while OrderedDict can trivially be used as the basis for an OrderedSet implementation. On Fri, Jul 17, 2009 at 9:57 PM, Michael Lenzen<m.lenzen@gmail.com> wrote: I posted this in the wrong thread before, sorry.
A Counter class is NOT a multiset/bag, it allows negative and zero elements. This makes as much sense to me as having negative elements in a Set. This is all besides the fact that Counter is nothing more than defaultdict(int) so it should be removed regardless of whether or not bags are added.
Truth be told, it's more than just defaultdict(int). It adds .elements() and .most_common().
Seems bag-like to me. - Unordered? Check. - Allows duplicates? Check. - O(1) containment test? Check. - Counts multiplicity of elements? Check. - Iterable? Check.
The only non-bag thing about it is allowing 0 and negative multiplicities, which I agree is unintuitive; I don't like that "feature" either.
Actually, from the docs, it also appears (I don't have 3.0 handy to test) to get len() wrong, using the dict definition of "number of unique items" rather than just "number of items" as would be more appropriate for a bag. In the event a Bag is not added, +1 for adding a method to Counter to return `sum(count if count > 0 else 0 for count in a_counter.values())` Cheers, Chris -- http://blog.rebertia.com

On 07/18/2009 01:30 AM, Chris Rebert wrote:
On Fri, Jul 17, 2009 at 11:18 PM, Chris Rebert<pyideas@rebertia.com> wrote:
Truth be told, it's more than just defaultdict(int). It adds .elements() and .most_common().
Seems bag-like to me. - Unordered? Check. - Allows duplicates? Check. - O(1) containment test? Check. - Counts multiplicity of elements? Check. - Iterable? Check.
The only non-bag thing about it is allowing 0 and negative multiplicities, which I agree is unintuitive; I don't like that "feature" either.
Actually, from the docs, it also appears (I don't have 3.0 handy to test) to get len() wrong, using the dict definition of "number of unique items" rather than just "number of items" as would be more appropriate for a bag.
In the event a Bag is not added, +1 for adding a method to Counter to return `sum(count if count> 0 else 0 for count in a_counter.values())`
Cheers, Chris
As well as getting len() wrong, it gets iteration wrong. It iterates over elements with counts of 0 and -1 as well as only iterating once over elements that appear multiple times. Yes you can iterate over .elements(), but this should be the default not a special case. As for adding most_common, it just calls heapq.nlargest(n, self.items(), key=_itemgetter(1)) which anyone can do, and my bag class does. My bag class behaves like a collection and provides a .unique_elements() method that returns the underlying set. You can .add(elem) and .delete(elem) just like you can with a set, or you can manually change their multiplicities like in Counter with bag[elem] = 5 or bag[elem] -= 2. If Counter is supposed to be a collection of elements, this makes no sense:
c = Counter() c['a'] += 1 c['a'] -= 1 'a' in c True
-Michael Lenzen

I drove a similar discussion a few months ago (April or May), nobody was opposing then, so I'd say go ahead and file a feature request. Of course, I'm personally strongly +1, especially for ordered set/unique list. Various implementations: http://code.activestate.com/recipes/576694/ http://www.peterbe.com/plog/uniqifiers-benchmark/ A list with no duplicate values is a common problem judging on the following: http://www.google.com/codesearch?q=lang%3Apython+list+unique http://www.google.com/codesearch?q=lang%3Apython+list+duplicates http://code.activestate.com/recipes/52560/ http://www.google.com/search?q=python+unique+list

I was thinking of submitting a PEP, but I wanted to get feedback first and finish up my reference implementation. -Michael Lenzen http://code.google.com/p/python-data-structures/wiki/collections On 07/18/2009 01:03 PM, Mart Sõmermaa wrote:
I drove a similar discussion a few months ago (April or May), nobody was opposing then, so I'd say go ahead and file a feature request.
Of course, I'm personally strongly +1, especially for ordered set/unique list.
Various implementations:
http://code.activestate.com/recipes/576694/ http://www.peterbe.com/plog/uniqifiers-benchmark/
A list with no duplicate values is a common problem judging on the following:
http://www.google.com/codesearch?q=lang%3Apython+list+unique http://www.google.com/codesearch?q=lang%3Apython+list+duplicates http://code.activestate.com/recipes/52560/ http://www.google.com/search?q=python+unique+list _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On 07/18/2009 01:03 PM, Mart Sõmermaa wrote:
I drove a similar discussion a few months ago (April or May), nobody was opposing then, so I'd say go ahead and file a feature request.
Of course, I'm personally strongly +1, especially for ordered set/unique list.
Various implementations:
http://code.activestate.com/recipes/576694/ http://www.peterbe.com/plog/uniqifiers-benchmark/
A list with no duplicate values is a common problem judging on the following:
http://www.google.com/codesearch?q=lang%3Apython+list+unique http://www.google.com/codesearch?q=lang%3Apython+list+duplicates http://code.activestate.com/recipes/52560/ http://www.google.com/search?q=python+unique+list
On Sat, Jul 18, 2009 at 11:35 AM, Michael Lenzen<m.lenzen@gmail.com> wrote:
I was thinking of submitting a PEP, but I wanted to get feedback first and finish up my reference implementation.
A PEP is probably not be necessary just to add a class to the std lib. Counter was added without a PEP after all. You would need to convince the committers though. Cheers, Chris -- http://blog.rebertia.com

My full proposition includes 2 new classes, 2 new abstract base classes, modifying set and a new `collection` function to abstract away the new multitude of classes. The collection function is my favorite part, you just pass it parameters about the collection you want (mutable, ordered and/or unique) and it returns the collection that fits without you ever having to figure it out. -Michael Lenzen
On Sat, Jul 18, 2009 at 11:35 AM, Michael Lenzen<m.lenzen@gmail.com> wrote:
I was thinking of submitting a PEP, but I wanted to get feedback first and finish up my reference implementation.
A PEP is probably not be necessary just to add a class to the std lib. Counter was added without a PEP after all. You would need to convince the committers though.
Cheers, Chris

On Jul 18, 2009, at 4:07 PM, Michael Lenzen wrote:
My full proposition includes 2 new classes, 2 new abstract base classes, modifying set and a new `collection` function to abstract away the new multitude of classes.
The collection function is my favorite part, you just pass it parameters about the collection you want (mutable, ordered and/or unique) and it returns the collection that fits without you ever having to figure it out.
I was silent on this because I wanted someone that has a say in this stuff to stop it but as you really think this is important I have to point out that this function is probably unaceptable, it makes code that much harder to read (instead of set or list I have to see each parameter and form on my mind what they map to) and also GvR already said before that he doesn't like functions that have flags that make them do completely different things based on those. Multiset seems interesting, some of the ABCs you mentioned seems to be just over engineering (maybe they just need a better justification) but this function in my view is unacceptable. -- Leonardo Santagada santagada at gmail.com

On 07/18/2009 10:19 PM, Leonardo Santagada wrote:
On Jul 18, 2009, at 4:07 PM, Michael Lenzen wrote:
My full proposition includes 2 new classes, 2 new abstract base classes, modifying set and a new `collection` function to abstract away the new multitude of classes.
The collection function is my favorite part, you just pass it parameters about the collection you want (mutable, ordered and/or unique) and it returns the collection that fits without you ever having to figure it out.
I was silent on this because I wanted someone that has a say in this stuff to stop it but as you really think this is important I have to point out that this function is probably unaceptable, it makes code that much harder to read (instead of set or list I have to see each parameter and form on my mind what they map to) and also GvR already said before that he doesn't like functions that have flags that make them do completely different things based on those.
Multiset seems interesting, some of the ABCs you mentioned seems to be just over engineering (maybe they just need a better justification) but this function in my view is unacceptable.
I don't understand what's so bad about such a collection factory. As a programmer I don't care what class an object is. I just want an object that behaves a certain way. E.g. I want to say: I need a random accessible collection, or I need a collection with unique entries, or I need a collection with a fast (near to O(1)) way to append elements, or all of the above. What class this object will have I don't care. Where is the problem? -panzi

On Sun, 19 Jul 2009 06:45:46 am Mathias Panzenböck wrote:
I don't understand what's so bad about such a collection factory. As a programmer I don't care what class an object is. I just want an object that behaves a certain way. E.g. I want to say: I need a random accessible collection, or I need a collection with unique entries, or I need a collection with a fast (near to O(1)) way to append elements, or all of the above. What class this object will have I don't care.
Where is the problem?
The problem is that you *do* need to care what the class is, because different classes have different interfaces:
def selector(code): ... return {0: list, 1: tuple, 2: set}[code] ... k = selector(0) instance1 = k([1, 2, 3, 4]) k = selector(2) instance2 = k([1, 2, 3, 4])
So far so good. But now:
instance1.append(5) instance2.append(5) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'set' object has no attribute 'append'
Unless you can guarantee identical interfaces, you can't use the different collection classes as drop-in replacements for each other. And you can't have identical interfaces once you have mutable and immutable types, so the whole idea is still-born. -- Steven D'Aprano

On 07/19/2009 03:24 AM, Steven D'Aprano wrote:
On Sun, 19 Jul 2009 06:45:46 am Mathias Panzenböck wrote:
I don't understand what's so bad about such a collection factory. As a programmer I don't care what class an object is. I just want an object that behaves a certain way. E.g. I want to say: I need a random accessible collection, or I need a collection with unique entries, or I need a collection with a fast (near to O(1)) way to append elements, or all of the above. What class this object will have I don't care.
Where is the problem?
The problem is that you *do* need to care what the class is, because different classes have different interfaces:
def selector(code): ... return {0: list, 1: tuple, 2: set}[code] ... k = selector(0) instance1 = k([1, 2, 3, 4]) k = selector(2) instance2 = k([1, 2, 3, 4])
So far so good. But now:
instance1.append(5) instance2.append(5) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'set' object has no attribute 'append'
Unless you can guarantee identical interfaces, you can't use the different collection classes as drop-in replacements for each other. And you can't have identical interfaces once you have mutable and immutable types, so the whole idea is still-born.
True. Ok, I didn't realize that the possible returned classes where that different. However, I always thought it's odd that pythons list has append but the set class has the method add. Whats the reason for that anyway? What I thought of was a function that would return (in Java terms) instances of: ArrayList, LinkedList, HashSet, TreeSet, LinkedHashSet, ConcurrentSkipListSet, CopyOnWriteArraySet, CopyOnWriteArrayList, Vector etc. (In Java terms: All cast to the super type Collection.) So I don't have to study the class hierarchy and documentation on what class to use best (and make apropriate changes when there is a new, better fitting class introduced). The factory will choose the best fitting class for me. However, I don't think there are enough classes one could choose from in the python standard library for this to be appropriate. -panzi

On Sun, 19 Jul 2009 11:44:00 am Mathias Panzenböck wrote:
I always thought it's odd that pythons list has append but the set class has the method add. Whats the reason for that anyway?
You can't append an item to the end of the set, because sets don't have an end. They're unordered. -- Steven D'Aprano

On 07/19/2009 04:14 AM, Steven D'Aprano wrote:
On Sun, 19 Jul 2009 11:44:00 am Mathias Panzenböck wrote:
I always thought it's odd that pythons list has append but the set class has the method add. Whats the reason for that anyway?
You can't append an item to the end of the set, because sets don't have an end. They're unordered.
Yes, but you can add an element to a list. That the position where it is added is the end is under the interface of the super class just coincidence. -panzi

Mathias Panzenböck wrote:
I always thought it's odd that pythons list has append but the set class has the method add. Whats the reason for that anyway?
In English, "append" means "add at the end", which doesn't make sense for an unordered collection. On the other hand, "add" doesn't imply anything about ordering, so using it for lists would make the intent less clear. -- Greg

Dnia 19-07-2009 o 03:24:38 Steven D'Aprano <steve@pearwood.info> napisał(a):
On Sun, 19 Jul 2009 06:45:46 am Mathias Panzenböck wrote:
I don't understand what's so bad about such a collection factory. As a programmer I don't care what class an object is. I just want an object that behaves a certain way. E.g. I want to say: I need a random accessible collection, or I need a collection with unique entries, or I need a collection with a fast (near to O(1)) way to append elements, or all of the above. What class this object will have I don't care.
Where is the problem?
The problem is that you *do* need to care what the class is, because different classes have different interfaces:
It could be a factory similar to namedtuple: creating not objects of container type, but container types: OrdSet = collection(ordered=True, unique=True) numbers = OrdSet([1, 2, 3 ,4 ,5]) letters = OrdSet(['x', 'y', 'z']) names = OrdSet(['Ann', 'Berta', 'Claude', 'Danuta']) -- Jan Kaliszewski <zuo@chopin.edu.pl>

On 07/18/2009 03:19 PM, Leonardo Santagada wrote:
I was silent on this because I wanted someone that has a say in this stuff to stop it but as you really think this is important I have to point out that this function is probably unaceptable, it makes code that much harder to read (instead of set or list I have to see each parameter and form on my mind what they map to) and also GvR already said before that he doesn't like functions that have flags that make them do completely different things based on those.
Multiset seems interesting, some of the ABCs you mentioned seems to be just over engineering (maybe they just need a better justification) but this function in my view is unacceptable.
-- Leonardo Santagada santagada at gmail.com
Thanks, this is why I posted here before I proposed anything formally, I wasn't aware of the precedence. Maybe I didn't explain it clearly enough, or maybe I'm just wrong. The Collection ABC was mostly to unify operations of the different collections for the collection function, so that a generic collection returned by the collection function meant something, you wouldn't have to look back to see how it will perform. For example all collections would have to provide a count(elem) method, or all mutable collections could have to provide add and remove methods. I don't understand how this code becomes unreadable.
hand = collection(ordered=False, unique=False) for i in range(5): hand.add(deck.pop_card()) for card in cards: if hand.count(card) == 2: print('You have a pair.')
-Michael Lenzen

On Sun, 19 Jul 2009 06:54:39 am Michael Lenzen wrote:
I don't understand how this code becomes unreadable.
hand = collection(ordered=False, unique=False) for i in range(5): hand.add(deck.pop_card())
Do you expect that code to work if the caller had used "mutable=False" in the call to collection()?
for card in cards: if hand.count(card) == 2: print('You have a pair.')
Why would a class with unique=True have a count() method? -- Steven D'Aprano

On Sun, 19 Jul 2009 11:28:37 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
for card in cards: if hand.count(card) == 2: print('You have a pair.') Why would a class with unique=True have a count() method?
So that you can write code that doesn't care which of the various flavors of collection it gets passed, but will work on them anyway. It supports a property of classes in OO programming called "polymorphism". It's a good thing. That python's collections don't have it has always been a minor wart, and it gets bigger as we get more types of collections. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Jul 19, 2009, at 1:31 AM, Mike Meyer wrote:
On Sun, 19 Jul 2009 11:28:37 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
for card in cards: if hand.count(card) == 2: print('You have a pair.') Why would a class with unique=True have a count() method?
So that you can write code that doesn't care which of the various flavors of collection it gets passed, but will work on them anyway.
It supports a property of classes in OO programming called "polymorphism". It's a good thing. That python's collections don't have it has always been a minor wart, and it gets bigger as we get more types of collections.
For example ints don't support slicing and that is not an omission of polymorphism, there is no sense in having that on them. A dict or set having a count method would make no sense either. This has nothing to do with polymorphism. add on lists or append on sets could be discussed (maybe append on sets makes more sense as they already have pop), but adding count in dict or set would not be pragmatic. -- Leonardo Santagada santagada at gmail.com

On Sun, 19 Jul 2009 04:00:48 -0300 Leonardo Santagada <santagada@gmail.com> wrote:
On Sun, 19 Jul 2009 11:28:37 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
for card in cards: if hand.count(card) == 2: print('You have a pair.') Why would a class with unique=True have a count() method?
So that you can write code that doesn't care which of the various flavors of collection it gets passed, but will work on them anyway.
It supports a property of classes in OO programming called "polymorphism". It's a good thing. That python's collections don't have it has always been a minor wart, and it gets bigger as we get more types of collections. For example ints don't support slicing and that is not an omission of
On Jul 19, 2009, at 1:31 AM, Mike Meyer wrote: polymorphism, there is no sense in having that on them.
Given that there aren't any types related to int for which slicing makes sense, this is true.
A dict or set having a count method would make no sense either. This has nothing to do with polymorphism.
Actually, count on a dict makes *lots* of sense. I can think of a lot of uses for counting the number of occurrences of values in a dictionary, mostly revolving around using dictionaries to implement sparse arrays of various kinds. As for sets, the "count" method is obviously useful on multisets. And it takes on reasonable values for the subset of multisets with unique values. Yes, it's either 1 or 0, and yes, it's isomorphic to testing membership, but it makes sense. An algorithm that uses count on multisets should work just fine on sets - that's what polymorphism buys for you. Saying that "count" doesn't make sense on sets is a lot like saying "length" and "width" don't make sense on squares, since those are equal - which would mean code written to handle rectangles wouldn't work on squares, unless they were crammed into rectangle objects. Part of the problem is that Python's collections types are a collection (no pun intended) of things that have shown themselves to be useful in practice. This has resulted in types that have nearly identical sets of operations on them, but the operations have different names, so the types - which would be polymorphic if they'd been designed to work together - aren't polymorphic. I get bit by this fairly often. Most recently spending last friday afternoon auditing code to make sure I'd turned all the list methods into the appropriate set methods in a graph class's node walking method.
add on lists or append on sets could be discussed (maybe append on sets makes more sense as they already have pop), but adding count in dict or set would not be pragmatic.
Nah, it'd just be useful. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Sun, 19 Jul 2009 05:47:28 pm Mike Meyer wrote:
A dict or set having a count method would make no sense either. This has nothing to do with polymorphism.
Actually, count on a dict makes *lots* of sense. I can think of a lot of uses for counting the number of occurrences of values in a dictionary, mostly revolving around using dictionaries to implement sparse arrays of various kinds.
But dicts don't allow you to look up on values, you look up on keys. dict.count(value) might be useful -- I doubt that there are "a lot of uses", but there are probably a few -- but it's hardly a fundamental operation on a dict. I'd argue it would be better as a function count_value(mapping, value) rather than a method. That way, you can make it polymorphic on the first argument, without requiring that all mappings inherit from each other. Arguably you might have dict.count(key) as a dict method, but since that can only return 0 or 1, it's just another way of spelling key in dict which is faster, shorter and easier to read.
As for sets, the "count" method is obviously useful on multisets.
But not for sets, where it is just another way of spelling "element in set".
And it takes on reasonable values for the subset of multisets with unique values. Yes, it's either 1 or 0, and yes, it's isomorphic to testing membership, but it makes sense. An algorithm that uses count on multisets should work just fine on sets - that's what polymorphism buys for you.
That can't be true, because their behaviour is different and they therefore aren't interchangeable. If obj is either a set or a multiset, the state of obj after: obj.add(element) obj.add(element) obj.remove(element) will be different. An algorithm that expects obj to be empty will fail if obj is a multiset, and an algorithm that expects obj will not be empty will fail if it is a set. Either way, they aren't interchangeable.
Saying that "count" doesn't make sense on sets is a lot like saying "length" and "width" don't make sense on squares, since those are equal - which would mean code written to handle rectangles wouldn't work on squares, unless they were crammed into rectangle objects.
Funny you should mention that. Squares and rectangles is just another example of the Circle And Ellipse Problem: http://www.c2.com/cgi/wiki?CircleAndEllipseProblem You can't expect to use squares everywhere you use rectangles. Assuming mutable objects: rectangle.width = 2*rectangle.height assert rectangle.width == 2*rectangle.height will work, but: square.width = 2*square.height if problematic. There are two "reasonable" approaches to it: (1) keep the invariant width==height by setting both the width and the height to twice the existing height, which will then cause: assert square.width == 2*square.height to fail; or (2) treat the square as if it were a rectangle, and stretch the object as requested, but that will break the invariant that squares are, well, square. Either way, you break something. Note that the problem is slightly different if you use immutable objects. Now you can't stretch a square into a rectangular shape, but how do you create your objects? Both of: square(width=5, height=10) rectangle(width=5) will fail. (Although rectangle could make height optional, and default to the same value as width. However, that's also problematic: "In the face of ambiguity, refuse the temptation to guess".)
Part of the problem is that Python's collections types are a collection (no pun intended) of things that have shown themselves to be useful in practice. This has resulted in types that have nearly identical sets of operations on them, but the operations have different names, so the types - which would be polymorphic if they'd been designed to work together - aren't polymorphic.
They wouldn't be polymorphic, because they aren't polymorphic. The following is a good example of why they're not polymorphic, but some algorithms are agnostic to the choice of data structure:
I get bit by this fairly often. Most recently spending last friday afternoon auditing code to make sure I'd turned all the list methods into the appropriate set methods in a graph class's node walking method.
Lists are ordered, sets are not. You can *insert* or *append* to a list, but not to a set, because sets aren't ordered. True, one could artificially pretend that sets are polymorphic to lists by using misleading method names: set.add -> set.append and do-nothing methods like set.sort() and set.reverse(). But apart from satisfying people who have read too many OO books, what would be the point? -- Steven D'Aprano

On Sun, 19 Jul 2009 18:49:09 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, 19 Jul 2009 05:47:28 pm Mike Meyer wrote:
A dict or set having a count method would make no sense either. This has nothing to do with polymorphism.
Actually, count on a dict makes *lots* of sense. I can think of a lot of uses for counting the number of occurrences of values in a dictionary, mostly revolving around using dictionaries to implement sparse arrays of various kinds.
But dicts don't allow you to look up on values, you look up on keys.
So? I wasn't talking about looking up values, I was talking about counting them. Of course, lists don't do that very well either - you can only look up two values, or maybe it's one. And that's just another thing that makes changing an implementations from lists to dicts
dict.count(value) might be useful -- I doubt that there are "a lot of uses", but there are probably a few -- but it's hardly a fundamental operation on a dict. I'd argue it would be better as a function count_value(mapping, value) rather than a method. That way, you can make it polymorphic on the first argument, without requiring that all mappings inherit from each other.
That would be fine - except that count_values(list, value) is another way of spelling list.count(value).
Arguably you might have dict.count(key) as a dict method,
You might argue that, but I wasn't.
As for sets, the "count" method is obviously useful on multisets. But not for sets, where it is just another way of spelling "element in set".
True. And if you know you're working with sets, that's the preferred spelling.
And it takes on reasonable values for the subset of multisets with unique values. Yes, it's either 1 or 0, and yes, it's isomorphic to testing membership, but it makes sense. An algorithm that uses count on multisets should work just fine on sets - that's what polymorphism buys for you.
That can't be true, because their behaviour is different and they therefore aren't interchangeable. If obj is either a set or a multiset, the state of obj after:
obj.add(element) obj.add(element) obj.remove(element)
will be different.
That just means the two classes don't always behave the same, which is why we give them different names.
An algorithm that expects obj to be empty will fail if obj is a multiset, and an algorithm that expects obj will not be empty will fail if it is a set. Either way, they aren't interchangeable.
So? An algorithm that doesn't depend on that behavior will still work. Yes, multisets and sets are different types, so not all algorithms that work on one will work properly on the other. Making them polymorphic will make the set of algorithms that work on both larger, which helps with not repeating yourself.
Saying that "count" doesn't make sense on sets is a lot like saying "length" and "width" don't make sense on squares, since those are equal - which would mean code written to handle rectangles wouldn't work on squares, unless they were crammed into rectangle objects.
Funny you should mention that. Squares and rectangles is just another example of the Circle And Ellipse Problem:
http://www.c2.com/cgi/wiki?CircleAndEllipseProblem
You can't expect to use squares everywhere you use rectangles.
I would never argue that you can. If there were identical, we wouldn't have two names for them.
Assuming mutable objects:
That's an *excellent* place to start if you're going to show they're different kinds of objects! Ditto for sets vs. multisets. It's their *values* that determine which class they belong to, so changing their values is the easiest way to break them.
Note that the problem is slightly different if you use immutable objects. Now you can't stretch a square into a rectangular shape, but how do you create your objects? Both of:
But only slightly - you have to know what type of object you're creating if you're going to create them. This is a well-know - and fairly well-analyzed - problem. Created types need to be contravariant instead of covariant.
Part of the problem is that Python's collections types are a collection (no pun intended) of things that have shown themselves to be useful in practice. This has resulted in types that have nearly identical sets of operations on them, but the operations have different names, so the types - which would be polymorphic if they'd been designed to work together - aren't polymorphic. They wouldn't be polymorphic, because they aren't polymorphic. The following is a good example of why they're not polymorphic, but some algorithms are agnostic to the choice of data structure:
They aren't polymorphic because python has chosen names that make them non-polymorphic.
I get bit by this fairly often. Most recently spending last friday afternoon auditing code to make sure I'd turned all the list methods into the appropriate set methods in a graph class's node walking method. Lists are ordered, sets are not. You can *insert* or *append* to a list, but not to a set, because sets aren't ordered.
Of course I can insert and append to a set. Just because the set loses the ordering information doesn't mean the element wasn't added to the set (or not, depending).
True, one could artificially pretend that sets are polymorphic to lists by using misleading method names:
set.add -> set.append
and do-nothing methods like set.sort() and set.reverse(). But apart from satisfying people who have read too many OO books, what would be the point?
Saving time for people who are used to having a well-designed hierarchy of collections by letting them not waste time dealing with name differences over implementation details. Things like repeating myself by writing foo_for_sets and foo_for_dicts and foo_for_lists, or changing method names, or mapping between two representations so I can use code that chose one set of names, or dealing with the headaches of maintaining two copies of a structure. Python has been moving this direction for a long time - int/long unification, lists being acceptable most places that tuples are, adding iterator interfaces to the collection types - all are things that make using those types in a polymorphic manner easier. This is just another step in that direction. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Mike Meyer <mwm-keyword-python.b4bdba@mired.org> wrote:
On Sun, 19 Jul 2009 18:49:09 +1000 Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, 19 Jul 2009 05:47:28 pm Mike Meyer wrote:
A dict or set having a count method would make no sense either. This has nothing to do with polymorphism.
Actually, count on a dict makes *lots* of sense. I can think of a lot of uses for counting the number of occurrences of values in a dictionary, mostly revolving around using dictionaries to implement sparse arrays of various kinds.
But dicts don't allow you to look up on values, you look up on keys.
So? I wasn't talking about looking up values, I was talking about counting them. Of course, lists don't do that very well either - you can only look up two values, or maybe it's one. And that's just another thing that makes changing an implementations from lists to dicts
dict.count(value) might be useful -- I doubt that there are "a lot of uses", but there are probably a few -- but it's hardly a fundamental operation on a dict. I'd argue it would be better as a function count_value(mapping, value) rather than a method.
You seem to have forgot that now we have dict views. I think it should be a method -- but of dict values-view rather than of dict itself. -- Jan Kaliszewski <zuo@chopin.edu.pl>

Mike Meyer wrote:
Saving time for people who are used to having a well-designed hierarchy of collections by letting them not waste time dealing with name differences over implementation details. Things like repeating myself by writing foo_for_sets and foo_for_dicts and foo_for_lists, or changing method names, or mapping between two representations so I can use code that chose one set of names, or dealing with the headaches of maintaining two copies of a structure.
Blurring semantic distinctions (add != append) to make overgeneralisation easier doesn't sound like a good idea to me. Choose the data structure that makes the most sense for the current algorithm. If someone gives you a different data structure, convert it at the beginning and convert it back at the end (think of it as a variant of the decorate-sort-undecorate approach to sorting). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Jul 19, 2009, at 1:07 PM, Nick Coghlan wrote:
Blurring semantic distinctions (add != append) to make overgeneralisation easier doesn't sound like a good idea to me.
Well said. Raymond

2009/7/18 Chris Rebert <pyideas@rebertia.com>:
A PEP is probably not be necessary just to add a class to the std lib. Counter was added without a PEP after all. You would need to convince the committers though.
For something like this, you could release your code as a separate module. Then let it get some popular support before asking for it to be promoted to the standard library. (Of course the proposed set changes might not be possible this way, but they are the most likely aspect to get rejected, because of backward compatibility concerns, in any case). Paul.

On 07/18/2009 02:28 PM, Paul Moore wrote:
2009/7/18 Chris Rebert<pyideas@rebertia.com>:
A PEP is probably not be necessary just to add a class to the std lib. Counter was added without a PEP after all. You would need to convince the committers though.
For something like this, you could release your code as a separate module. Then let it get some popular support before asking for it to be promoted to the standard library. (Of course the proposed set changes might not be possible this way, but they are the most likely aspect to get rejected, because of backward compatibility concerns, in any case).
Paul.
My module (collections_) is a drop in replacement for collections, it imports everything from there, then adds it to __all__ so that you can import it from collections_. The only caveat is that you have to import collections_.set and collections_.frozenset to override the default implementations, but it works. -Michael Lenzen

Michael Lenzen wrote:
On 07/18/2009 01:30 AM, Chris Rebert wrote:
On Fri, Jul 17, 2009 at 11:18 PM, Chris Rebert<pyideas@rebertia.com> wrote:
Truth be told, it's more than just defaultdict(int). It adds .elements() and .most_common().
Seems bag-like to me. - Unordered? Check. - Allows duplicates? Check. - O(1) containment test? Check. - Counts multiplicity of elements? Check. - Iterable? Check.
The only non-bag thing about it is allowing 0 and negative multiplicities, which I agree is unintuitive; I don't like that "feature" either.
Actually, from the docs, it also appears (I don't have 3.0 handy to test) to get len() wrong, using the dict definition of "number of unique items" rather than just "number of items" as would be more appropriate for a bag.
In the event a Bag is not added, +1 for adding a method to Counter to return `sum(count if count> 0 else 0 for count in a_counter.values())`
Cheers, Chris
As well as getting len() wrong, it gets iteration wrong. It iterates over elements with counts of 0 and -1 as well as only iterating once over elements that appear multiple times. Yes you can iterate over .elements(), but this should be the default not a special case.
As for adding most_common, it just calls heapq.nlargest(n, self.items(), key=_itemgetter(1)) which anyone can do, and my bag class does.
My bag class behaves like a collection and provides a .unique_elements() method that returns the underlying set. You can .add(elem) and .delete(elem) just like you can with a set, or you can manually change their multiplicities like in Counter with bag[elem] = 5 or bag[elem] -= 2.
If Counter is supposed to be a collection of elements, this makes no sense:
c = Counter() c['a'] += 1 c['a'] -= 1 'a' in c True
I encourage you to put your questions/concerns regarding the new collections.Counter class into a separate email and send them to python-dev. It seems to me that it is possible some revisions could be made to the API. Whether or not that happens will depend on the precise use cases Raymond had in mind when he added it, but even if nothing changes such an email thread should provide some more insight into the rationale driving the API design choices. Although rather than calling the current API wrong from the outset, I'd suggest phrasing it as asking "why is the interface this way?". We don't know whether or not the API is actually wrong without knowing the objectives Raymond was setting out to achieve. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Fri, Jul 17, 2009 at 6:54 PM, Michael Lenzen<m.lenzen@gmail.com> wrote: <snip>
New ABCs (abstract base classes) -------------------------------- <snip> Mutable - I think there should be one metaclass for all mutable types just like there is currently Iterable, Sized and Container. We could then check instanceof(obj, Mutable). MutableSequence, MutableSet and MutableMapping would inherit Mutable, or we might be able to just do away with them altogether. There are 3 abstractmethods defined: * pop(self) - This is the simplest to implement, all mutable classes already do. * __setitem__(self, key, value) - This is already implemented for list and dict but would have to be defined for set, more later * __delitem__(self, key) Same as above.
I think this would be better named MutableCollection personally, otherwise it's liable to be confused with the more general mutable vs. immutable types distinction. <snip>
Extending set to fit the ABCs ----------------------------- I mentioned before that sets don't currently fit into my model. Here are my propositions for the three methods that would need to be implemented. * set.__setitem__(self, elem, value) - Set whether or not elem is in the set based on the boolean evaluation of value. s[elem] = 1 would call s.add(elem) and s[elem] = False would call s.remove(elem), appropriately raising a KeyError if elem not in s. The reason I chose this way is that the only thing you can set about an element of a set is whether or not it is present.
The implicit bool(), which I assume is what you mean by "boolean evaluation", here concerns me, as it could mask bugs where something tries to treat a set like a mapping. I would recommend throwing TypeError if given a non-int and ValueError if given any integer besides 0 or 1, so as to more strongly enforce the set axioms. Cheers, Chris -- http://blog.rebertia.com

On 07/18/2009 03:32 PM, Chris Rebert wrote:
On Fri, Jul 17, 2009 at 6:54 PM, Michael Lenzen<m.lenzen@gmail.com> wrote: <snip>
New ABCs (abstract base classes) -------------------------------- <snip> Mutable - I think there should be one metaclass for all mutable types just like there is currently Iterable, Sized and Container. We could then check instanceof(obj, Mutable). MutableSequence, MutableSet and MutableMapping would inherit Mutable, or we might be able to just do away with them altogether. There are 3 abstractmethods defined: * pop(self) - This is the simplest to implement, all mutable classes already do. * __setitem__(self, key, value) - This is already implemented for list and dict but would have to be defined for set, more later * __delitem__(self, key) Same as above.
I think this would be better named MutableCollection personally, otherwise it's liable to be confused with the more general mutable vs. immutable types distinction.
<snip>
Extending set to fit the ABCs ----------------------------- I mentioned before that sets don't currently fit into my model. Here are my propositions for the three methods that would need to be implemented. * set.__setitem__(self, elem, value) - Set whether or not elem is in the set based on the boolean evaluation of value. s[elem] = 1 would call s.add(elem) and s[elem] = False would call s.remove(elem), appropriately raising a KeyError if elem not in s. The reason I chose this way is that the only thing you can set about an element of a set is whether or not it is present.
The implicit bool(), which I assume is what you mean by "boolean evaluation", here concerns me, as it could mask bugs where something tries to treat a set like a mapping. I would recommend throwing TypeError if given a non-int and ValueError if given any integer besides 0 or 1, so as to more strongly enforce the set axioms.
Cheers, Chris
Sounds good to me. I changed my implementation and the description on the wiki. I also changed bags to enforce the int restriction. -Lenzen

On Jul 17, 2009, at 6:54 PM, Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
The last addition but the first I will list, because I think it is
FWIW, my goal for the collections module is to keep it relatively small and not to include every variant someone can think of. We've already rejected red-black trees, pairing heaps, and blist (which is a list variant that allows fast insertions). Starting with 2.7 and 3.1, we do have a Counter object that has some multiset capabilities -- I want to see how that fares in the real world before entertaining a proposal for another multiset/bag variant. With respect to ordered sets, I started by posting an efficient recipe on ASPN so that we could see how it gets used. Though I'm a fan of the idea in general, I've found few use cases and have noted nearly zero uptake on the recipe (so the OrderedSet idea may just be YAGNI). The wording of your proposal indicates that it is motivated by a desire to complete the grid of possible variants (frozen/ unfrozen, itemized/multivalued, ordered/unordered, etc), but it would have been better to focus on use cases (existing needs that aren't being met by the current toolsets). After posting the pure Python recipe for ordered sets, my plan had been to wait to see how it was received. If there was an active interest (none so far), I've already worked out the steps needed to extend existing C code to make the built-in sets remember their insertion order (no need for a separate type, just make the regular sets remember their order) -- this can be done with very little impact on performance. So that leaves your proposed extensions to the abstract base classes. My issue with those is that collections ABCs suffer from the same problem as the itertools module -- the more of them you add, the harder it is to learn and remember which ones to use (it is possible to go wild with collections abstractions and then find that zero value has been added). When developing the ABCs, Guido rejected the idea of adding intermediate layers of abstraction (IIRC, he documented his rationale in the PEP). My own thought on the subject is that the collections ABC development needs to be somewhat restrained. They are new and have not yet been widely adopted. The community is just now gaining experience with them and those experiences are vital for making informed choices on how or whether to grow the collections ABCs. Also, the collections ABCs that we already have have not been fully exercised or debugged. The ones we have need to be shaken-out before additional efforts are undertaken to extend them (or to create intermediate layers of abstraction). the coolest
and provides the most functionality, is the new function I propose adding to collections. The function is simply called 'collection' and has the following signature:
collection(iterable=None, mutable=False, ordered=False, unique=False)
There are a couple of problems here. Unifying many collections under a single API has the unfortunate side-effect of freezing development on those collections (it becomes hard to change one without changing them all). Another issue is that the API doesn't seem comfortable to me -- when I want a set or list or dict, I just say so -- I don't want to go through a menu of all of the possible variants. Also, it is not like those variants are readily substitutable for one another -- if your code relies on hashability, then you don't really have the option of switching the mutability flag. To me, this factory function is like a comp sci exercise in hyper-generalization -- it doesn't make any real-world code any cleaner. Sorry for the sour notes. You're definitely showing insight into the relationship between various collection classes but I don't think the ideas represent good, use-case driven language development. The proposals are all too theoretically based; instead, the evolution of the collections module needs to follow a slower path, one that lets the existing tools mature. And tools like OrderedSet need to have some early life as a recipe to see how people use it and to learn from their experiences. Raymond

On Sat, Jul 18, 2009 at 8:45 PM, Raymond Hettinger <python@rcn.com> wrote:
On Jul 17, 2009, at 6:54 PM, Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
FWIW, my goal for the collections module is to keep it relatively small and not to include every variant someone can think of. We've already rejected red-black trees, pairing heaps, and blist (which is a list variant that allows fast insertions).
Michael Lenzen: If you're interested in maintaining a more extensive collections module on PyPi, I'd be happy to collaborate with you as the author of 2 of the 3 aforementioned rejected data structures. :-) In addition to pairing heaps (deprecated) and blist (on PyPi), I've also got a few other data structures that may or may not interest you: - HeapDict (on PyPi): a dictionary where .popitem() returns the item with the lowest value - LRU: a dictionary that maintains hard references to the most recently used n items and weak references to the rest - MultiValueDict: x[5] = 1, x[5] = 2, print x[5] => set([1,2]) - WeakRefSet: the set analog to a WeakRefDict -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

On 07/19/2009 09:22 AM, Daniel Stutzbach wrote:
Michael Lenzen:
If you're interested in maintaining a more extensive collections module on PyPi, I'd be happy to collaborate with you as the author of 2 of the 3 aforementioned rejected data structures. :-)
In addition to pairing heaps (deprecated) and blist (on PyPi), I've also got a few other data structures that may or may not interest you: - HeapDict (on PyPi): a dictionary where .popitem() returns the item with the lowest value - LRU: a dictionary that maintains hard references to the most recently used n items and weak references to the rest - MultiValueDict: x[5] = 1, x[5] = 2, print x[5] => set([1,2]) - WeakRefSet: the set analog to a WeakRefDict
-- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
I am definitely interested in maintaining a more extensive collections module regardless of whether or not my suggestions make it into the Python library. I must admit though, that I am not a proponent of pushing all of the data structures into the standard library. I don't have any experience creating a package, although it doesn't seem too tough, so as much help as you'd like to give would be great. We'd also have to come up with a better name that 'collections_'. If you want to join the project, just shoot me an email. -Michael Lenzen http://code.google.com/p/python-data-structures/

On 07/18/2009 08:45 PM, Raymond Hettinger wrote: First of all, thanks for the input. I was hoping you would read and comment on this.
Starting with 2.7 and 3.1, we do have a Counter object that has some multiset capabilities -- I want to see how that fares in the real world before entertaining a proposal for another multiset/bag variant.
I'm not really a fan of the Counter class, I think that it has some fundamental flaws. The biggest problem that I see is that it doesn't behave like a collection with multiple elements, when you iterate over it you only get the unique elements and len(Counter) returns the number of unique elements instead of the number of items in the multiset. Then there's the fact that non-positive counts are allowed:
c = Counter() c['a'] += 1 c['a'] -= 1 'a' in c True
This isn't a collection, it's something else - a counter. I'm not saying it's not useful, but it's not a multiset. Why I think it should be removed is that `defaultdict(int)` now provides exactly the same functionality.
With respect to ordered sets, I started by posting an efficient recipe on ASPN so that we could see how it gets used. Though I'm a fan of the idea in general, I've found few use cases and have noted nearly zero uptake on the recipe (so the OrderedSet idea may just be YAGNI). The wording of your proposal indicates that it is motivated by a desire to complete the grid of possible variants (frozen/unfrozen, itemized/multivalued, ordered/unordered, etc), but it would have been better to focus on use cases (existing needs that aren't being met by the current toolsets).
After posting the pure Python recipe for ordered sets, my plan had been to wait to see how it was received. If there was an active interest (none so far), I've already worked out the steps needed to extend existing C code to make the built-in sets remember their insertion order (no need for a separate type, just make the regular sets remember their order) -- this can be done with very little impact on performance.
Your implementation of ordered sets also doesn't work for me. I want to be able to access elements by index, take slices of the list, just be able to use it as a list in general. I think it might be a good addition to the current set class, it's basically an improved set.
There are a couple of problems here. Unifying many collections under a single API has the unfortunate side-effect of freezing development on those collections (it becomes hard to change one without changing them all). Another issue is that the API doesn't seem comfortable to me -- when I want a set or list or dict, I just say so -- I don't want to go through a menu of all of the possible variants. Also, it is not like those variants are readily substitutable for one another -- if your code relies on hashability, then you don't really have the option of switching the mutability flag. To me, this factory function is like a comp sci exercise in hyper-generalization -- it doesn't make any real-world code any cleaner.
I'm not as gung ho about the ABCs as I was before. But, the way I see it is that the collection factory/ABC would simplify some situations, especially for new programmers. They wouldn't have to learn the difference between bags, sets, lists and setlists right away. They would just have an easy way to create a collection that provided a basic API, maybe just add, remove, contains and count methods. All the classes are still available by name to those that want them, and that is how I would probably use them.
Sorry for the sour notes. You're definitely showing insight into the relationship between various collection classes but I don't think the ideas represent good, use-case driven language development. The proposals are all too theoretically based; instead, the evolution of the collections module needs to follow a slower path, one that lets the existing tools mature. And tools like OrderedSet need to have some early life as a recipe to see how people use it and to learn from their experiences.
I agree that I need use-cases and will go through my old code and post some in the near future. I also think that use cases will follow if the option is provided, that it is the responsibility of the language developers to push innovation and a certain paradigm. I am firmly convinced that there need to be bag and setlist classes included at least in the collections module, if not the standard library. Python is supposed to be *batteries included* and I see these as glaring omissions. If my arguments were too theoretical it's because I'm a mathematician at heart and am more apt to be convinced by a theoretical argument as opposed to a utilitarian one. -Michael Lenzen

On 07/17/2009 08:54 PM, Michael Lenzen wrote:
In a nutshell, I want to add 2 new classes (and respective frozen varients) to the collections module, namely a bag (multiset) class and a setlist (ordered set/unique list) class. I know this has been floated before, but I think it merits more consideration.
So I have pared down my proposition to just adding bag and setlist classes, without the additional ABCs, modifications to set and collection factory. I know I still haven't posted use cases, but I'm hoping supporters of this will post some too. I haven't given up on the idea of a unified Collections interface and a collection factory, but baby steps are probably better. My implementation is here: http://python-data-structures.googlecode.com/svn/trunk/collections_.py It is a drop in replacement for collections, so
import collections_ as collections or from collections_ import bag, setlist
If I should post this to another more appropriate list, just let me know where. I plan on creating a package and submitting it to pypi , but I would like other people to test it first. -Michael Lenzen
participants (14)
-
Carl Johnson
-
Chris Rebert
-
Daniel Stutzbach
-
Greg Ewing
-
Jan Kaliszewski
-
Leonardo Santagada
-
Mart Sõmermaa
-
Mathias Panzenböck
-
Michael Lenzen
-
Mike Meyer
-
Nick Coghlan
-
Paul Moore
-
Raymond Hettinger
-
Steven D'Aprano