collections.sortedset proposal

Hi, I want to propose to include SortedSet data structure into collections module. SortedSet (name borrowed from Redis) is a basically a mapping of (unique) keys to scores, that allows fast slicing by ordinal number and by score. There are plenty of use cases for the sorted sets: * Leaderboard for a game * Priority queue (that supports task deletion) * Timer list (e.g. can be used for tulip, supports deletion too) * Caches with TTL-based, LFU or LRU eviction (including `functools.lru_cache`) * Search databases with relevance scores * Statistics (many use cases) * Replacement for `collections.Counter` with faster `most_common()` I have first draft of pure python implementation: https://github.com/tailhook/sortedsets http://pypi.python.org/pypi/sortedsets/1.0 The implementation is closely modeled on Redis. Internally it consists of a dict for mapping between keys and scores, and a skiplist for scores. So most operations are done with O(log n) time. The actual performance is probably very slow for pure-python implementation, but can be fixed by C code later. The asymptotic performance seems to be OK. So my questions are: 1. Do you think SortedSets are eligible for inclusion to stdlib? 2. Do I need a PEP? 3. Any comments on the implementation? P.S.: Sorted sets in redis are not the same thing as sorted sets in blist. So maybe a better name? -- Paul

On 12/25/2012 5:24 PM, Paul Colomiets wrote:
Since a set, in general, is not a mapping, I do not understand what you mean. If you mean a mapping from sorted position to item, then I would call it a sortedlist.
There are plenty of use cases for the sorted sets:
* Leaderboard for a game
This looks like an auto-sorted list.
* Priority queue (that supports task deletion)
This looks like something else.
* Timer list (e.g. can be used for tulip, supports deletion too) * Caches with TTL-based, LFU or LRU eviction (including `functools.lru_cache`)
These look like sorted lists.
* Search databases with relevance scores * Statistics (many use cases)
These are rather vague.
* Replacement for `collections.Counter` with faster `most_common()`
This looks like something else.
The standard answer is to list on or submit to pypi and get community approval and adoption. Then a pep with a commitment to maintenance even while others interfere with your 'baby'. Long-time core committers sometimes get to cut the process short, but even Guido is starting his propose async module/package with a pep and publicly available code for 3.3. -- Terry Jan Reedy

Hi, On Wed, Dec 26, 2012 at 9:58 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Ok. My description is vague. Here is one from Redis documentation: Redis Sorted Sets are, similarly to Redis Sets, non repeating collections of Strings. The difference is that every member of a Sorted Set is associated with score, that is used in order to take the sorted set ordered, from the smallest to the greatest score. While members are unique, scores may be repeated. http://redis.io/topics/data-types I was just so silly to suppose that everybody knows Redis data types.
Yep. The crucial property is fast insertion and updates.
* Priority queue (that supports task deletion)
This looks like something else.
Priority queue is basically an auto-sorted list too. No?
Yup. But we can't call the data structure SortedList, because elements must be unique.
Yes. Included just to give some overview.
* Replacement for `collections.Counter` with faster `most_common()`
This looks like something else.
Why? If you have a list sorted by counter values, you can have `most_common()` by slicing.
It's on the PyPI now. I know the standard answer :) So you don't understand what SortedSets are and what would be good name for data structure, or do you think it's useless? The crucial point of adoption, is that most of the time people don't want to add additional dependency for simple tasks like priority queue, even if it's faster or more featureful. And I think that SortedSets in Redis have proved their usefulness as a data structure. -- Paul

On Wed, Dec 26, 2012 at 7:59 PM, Paul Colomiets <paul@colomiets.name> wrote:
Perhaps you mean a heap queue? The standard library doesn't have a separate type for that, it just has some functions for treating a list as a heap: http://docs.python.org/2/library/heapq.html Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Nick, On Wed, Dec 26, 2012 at 12:12 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The problem with heap queue (as implemented in python) as priority queue or list of timers is that it does not support deletion of the tasks (at least not in efficient manner). For other use cases, e.g. for a leader board heapq doesn't allow efficient slicing. Or do you mean "heap queue" is a nice name for the data structure that redis calls "sorted set"? -- Paul

On Wed, Dec 26, 2012 at 8:59 PM, Paul Colomiets <paul@colomiets.name> wrote:
I mean if what you want is a heap queue with a more efficient heappop() implementation (due to a different underlying data structure), then it's probably clearer to call it that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

For the record, there is another implementation of skiplists on PyPI: http://pypi.python.org/pypi/skiplist/0.1.0 2012/12/26 Paul Colomiets <paul@colomiets.name>
-- Michele Lacchia

On 12/25/2012 5:24 PM, Paul Colomiets wrote:
Since a set, in general, is not a mapping, I do not understand what you mean. If you mean a mapping from sorted position to item, then I would call it a sortedlist.
There are plenty of use cases for the sorted sets:
* Leaderboard for a game
This looks like an auto-sorted list.
* Priority queue (that supports task deletion)
This looks like something else.
* Timer list (e.g. can be used for tulip, supports deletion too) * Caches with TTL-based, LFU or LRU eviction (including `functools.lru_cache`)
These look like sorted lists.
* Search databases with relevance scores * Statistics (many use cases)
These are rather vague.
* Replacement for `collections.Counter` with faster `most_common()`
This looks like something else.
The standard answer is to list on or submit to pypi and get community approval and adoption. Then a pep with a commitment to maintenance even while others interfere with your 'baby'. Long-time core committers sometimes get to cut the process short, but even Guido is starting his propose async module/package with a pep and publicly available code for 3.3. -- Terry Jan Reedy

Hi, On Wed, Dec 26, 2012 at 9:58 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Ok. My description is vague. Here is one from Redis documentation: Redis Sorted Sets are, similarly to Redis Sets, non repeating collections of Strings. The difference is that every member of a Sorted Set is associated with score, that is used in order to take the sorted set ordered, from the smallest to the greatest score. While members are unique, scores may be repeated. http://redis.io/topics/data-types I was just so silly to suppose that everybody knows Redis data types.
Yep. The crucial property is fast insertion and updates.
* Priority queue (that supports task deletion)
This looks like something else.
Priority queue is basically an auto-sorted list too. No?
Yup. But we can't call the data structure SortedList, because elements must be unique.
Yes. Included just to give some overview.
* Replacement for `collections.Counter` with faster `most_common()`
This looks like something else.
Why? If you have a list sorted by counter values, you can have `most_common()` by slicing.
It's on the PyPI now. I know the standard answer :) So you don't understand what SortedSets are and what would be good name for data structure, or do you think it's useless? The crucial point of adoption, is that most of the time people don't want to add additional dependency for simple tasks like priority queue, even if it's faster or more featureful. And I think that SortedSets in Redis have proved their usefulness as a data structure. -- Paul

On Wed, Dec 26, 2012 at 7:59 PM, Paul Colomiets <paul@colomiets.name> wrote:
Perhaps you mean a heap queue? The standard library doesn't have a separate type for that, it just has some functions for treating a list as a heap: http://docs.python.org/2/library/heapq.html Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Nick, On Wed, Dec 26, 2012 at 12:12 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The problem with heap queue (as implemented in python) as priority queue or list of timers is that it does not support deletion of the tasks (at least not in efficient manner). For other use cases, e.g. for a leader board heapq doesn't allow efficient slicing. Or do you mean "heap queue" is a nice name for the data structure that redis calls "sorted set"? -- Paul

On Wed, Dec 26, 2012 at 8:59 PM, Paul Colomiets <paul@colomiets.name> wrote:
I mean if what you want is a heap queue with a more efficient heappop() implementation (due to a different underlying data structure), then it's probably clearer to call it that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

For the record, there is another implementation of skiplists on PyPI: http://pypi.python.org/pypi/skiplist/0.1.0 2012/12/26 Paul Colomiets <paul@colomiets.name>
-- Michele Lacchia
participants (5)
-
Michele Lacchia
-
Nick Coghlan
-
Paul Colomiets
-
Serhiy Storchaka
-
Terry Reedy