Optional key to `bisect`'s functions?

The ``bisect`` stuff is pretty neat, although probably underused (especially the insorts), but their usefulness is limited by the requirement that the lists directly contain sortable items, as opposed to ``sorted`` or ``list.sort``. It's possible to "use" them by copy/pasting the (Python) functions into the project/library code and adding either a custom key directly or a key function, but while this can still yield an order-of-magnitude speed gain over post-sorting sequences, it's cumbersome and it loses the advantage of _bisect's accelerators. Therefore, I believe it would be pretty neat to add an optional ``key=`` keyword (only?) argument, with the same semantics as in ``sorted``. It would make ``bisect`` much easier to use especially in stead of append + sorted combinations. The key should work for both insertion functions and bisection search ones. Thoughts?

Hi, 2012/2/8 Masklinn <masklinn@masklinn.net>
This was proposed several times on the issue tracker (search for "bisect key"), and these proposals have always been rejected: http://bugs.python.org/issue4356 http://bugs.python.org/issue1451588 http://bugs.python.org/issue3374 The last one summarizes the reasons of the rejection. The documentation (http://docs.python.org/library/bisect.html, "see also") contains a link to a "SortedCollection" recipe. I haven't looked at the SortedCollection class in detail, but you could try to have it included in the stdlib... -- Amaury Forgeot d'Arc

Hmm... I disagree with Raymond's rejection of the proposed feature. I have come across use cases for this functionality multiple time in real life. Basically Raymond says "bisect can call the key() function many times, which leads to bad design". His alternative, to use a list of (key, value) tuples, is often a bit clumsy when passing the sorted list to another function (e.g. for printing); having to transform the list using e.g. [v for (k, v) in a] feels clumsy and suboptimal. So I'm not sure that refusing the key= option always leads to the best design (in the sense of the most readable code). Adding key= is particularly attractive since the current invariant is something like "if a == sorted(a) before the operation, then a == sorted(a) after the operation". Adding a key= option would simply change that to sorted(a, key=key) on both counts. Also note that "many times" is actually O(log N) per insertion, which isn't so bad. The main use case for bisect() is to manage a list that sees updates *and* iterations -- otherwise building the list unsorted and sorting it at the end would make more sense. The key= option provides a balance between the cost/elegance for updates and for iterations. --Guido On Wed, Feb 8, 2012 at 2:18 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
-- --Guido van Rossum (python.org/~guido)

On 2/8/2012 7:25 PM, Guido van Rossum wrote:
An alternative to the n x 2 array is two n-arrays or a 2 x n array. Then there is no problem using either the keys or the values array. To use insort_right or insort_left for this, they would have to return the insertion position instead of None. Right now one must use bisect_right or _left and then .insert into both arrays instead of just the vals array.
For *large enough* lists, the O(n*n) cost of insertions will dominate the O(n*logN) key() calls, so reducing the latter to O(n) key calls will not matter. -- Terry Jan Reedy

On 2012-02-09, at 01:25 , Guido van Rossum wrote:
Yes, this is the kind of things which prompted my original email. It is even clumsier when there are many (smaller) lists to manipulate and insert into in turn, and requires two verbose and potentially expensive phases of decoration and undecoration. Using two separate lists has similar (though simpler) issues, especially when producing API-related structures as the "helper" collection must be cleaned up during an undecoration phase. And as Terry notes, this gets very clumsy as each candidate insertion now requires three statements (a call to bisect_right to get the insertion index followed by an insertion into the helper list and an other one for the actual list).

On 9 February 2012 00:25, Guido van Rossum <guido@python.org> wrote:
Also, in Python 3 one can't assume that values will be comparable so the (key, value) tuple trick won't work: comparing the tuples may well throw a TypeError. Here's a simple example below. The class 'Person' has no natural order, but we may want to keep a list of people sorted by iq:
-- Arnaud

On Feb 8, 2012, at 4:25 PM, Guido van Rossum wrote:
Also note that "many times" is actually O(log N) per insertion, which isn't so bad. The main use case for bisect() is to manage a list that sees updates *and* iterations -- otherwise building the list unsorted and sorting it at the end would make more sense. The key= option provides a balance between the cost/elegance for updates and for iterations.
Would you be open to introducing a SortedList class to encapsulate the data so that key functions get applied no more than once per record and the sort order is maintained as new items are inserted? ISTM, the whole problem with bisect is that the underlying list is naked, leaving no way to easily correlate the sort keys with the corresponding sorted records. Raymond

On Thu, Feb 9, 2012 at 9:27 AM, Raymond Hettinger < raymond.hettinger@gmail.com> wrote:
Hm. A good implementation of such a thing would probably require a B-tree implementation (or some other tree). That sounds like a good data type to have in the collections module, but doesn't really address the desire to use bisect on a list. Also it further erodes my desire not to bother the programmer with subtle decisions about the choice of data type: the most basic types (string, number, list, dict) are easy enough to distinguish, and further choices between subclasses or alternatives are usually a distraction.
That same "problem" would exist for sorted() and list.sort(), and the solution there (parameterize it with a key= option) easily generalizes to bisect (and to heapq, for that matter). The more fundamental "conflict" here seems to be between algorithms and classes. list.sort(), bisect and heapq focus on the algorithm. In some sense they reflect the state of the world before object-oriented programming was invented. Sometimes it is useful to encapsulate these in classes. Other times, the encapsulation doesn't add to the clarity of the program. One more thing: bisect.py doesn't only apply to insertions. It is also useful to find a "nearest" elements in a pre-sorted list. Probably that list was sorted using list.sort(), possibly using key=... -- --Guido van Rossum (python.org/~guido)

On Feb 9, 2012, at 9:48 AM, Guido van Rossum wrote:
The more fundamental "conflict" here seems to be between algorithms and classes. list.sort(), bisect and heapq focus on the algorithm.
Bisect in particular had way too much focus on the algorithm. The API is awkward and error-prone for many common use cases. I've tried to remedy that through documenting how to implement the common use cases: http://docs.python.org/py3k/library/bisect.html#searching-sorted-lists The issue is that the current API focuses on "insertion points" rather than on finding values. Unfortunately, this API is very old, so the only way to fix it is to introduce a new class. If we introduced class around a sorted sequence, then we could make an reasonable API that corresponds to what people usually want to do with sorted sequences. Of course, that still leaves the issue with an O(n) insort. As Daniel pointed-out, a list is not the correct underlying data structure if you want to do periodic insertions and deletions. Raymond

On Thu, Feb 9, 2012 at 12:34 PM, Raymond Hettinger < raymond.hettinger@gmail.com> wrote:
Maybe you're overanalyzing the problem? It seems what you want would require a PEP and/or a reference implementation that is thoroughly tested as a 3rd party package before it warrants inclusion into the stdlib. In the mean time adding a key= option that echoes the API offered by list.sort() and sorted() is a no-brainer. -- --Guido van Rossum (python.org/~guido)

On Wed, Feb 8, 2012 at 4:25 PM, Guido van Rossum <guido@python.org> wrote:
Maintaining a sorted list using Python's list type is a trap. The bisect is O(log n), but insertion and deletion are still O(n). A SortedList class that provides O(log n) insertions is useful from time to time. There are several existing implementations available (I wrote one of them, on top of my blist type), each with their pros and cons. -- Daniel Stutzbach

On Feb 9, 2012, at 9:50 AM, Daniel Stutzbach wrote:
Maintaining a sorted list using Python's list type is a trap. The bisect is O(log n), but insertion and deletion are still O(n).
A SortedList class that provides O(log n) insertions is useful from time to time. There are several existing implementations available (I wrote one of them, on top of my blist type), each with their pros and cons.
I concur. People who want to maintain sorted collections (periodically adding and deleting items) are far better-off using your blist, a binary tree, or an in-memory sqlite database. Otherwise, we will have baited them into a non-scalable O(n) solution. Unfortunately, when people see the word "bisect", they will presume they've got O(log n) code. We've documented the issue with bisect.insort(), but the power of suggestion is very strong. Raymond

On 2/9/2012 3:16 PM, Raymond Hettinger wrote:
The omitted constants are such that the log n term dominates for 'small' n. list.sort internally uses binary insert sort for n up to 64. It only switches to mergesort for runs of at least 64.
Are your blist leaves lists (or arrays) of some maximum size?
Using insort on a list of a millions items is definitely not a good idea, but I can see how someone not so aware of scaling issues might be tempted, especially with no stdlib alternative. One could almost be tempted to issue a warning if 'hi' is 'too large'. -- Terry Jan Reedy

On 2012-02-09, at 03:42 , Terry Reedy wrote:
http://bugs.python.org/issue4356 Suggests a ``key=`` argument behaving as with ``sorted`` and ``list.sort``: collection values are decorated with the key before comparisons. This is exactly my original email. http://bugs.python.org/issue1451588 Suggests a ``cmp=`` argument (proposal precedes Python 3 and ``key=`` taking over) to use instead of the built-in comparison operator. http://bugs.python.org/issue3374 Suggests all of ``cmp=`` (this again being opposed in ``cmp=`` having been dropped from Python 3), ``key=`` and ``reverse=``. In summary, all three suggest following the existing API of ``list.sort`` and ``sorted``, and at least implementing its ``key=`` argument (I am taking issue1451588 as doing so, since it suggests the mechanism and argument which preceded ``key``)

Hi, 2012/2/8 Masklinn <masklinn@masklinn.net>
This was proposed several times on the issue tracker (search for "bisect key"), and these proposals have always been rejected: http://bugs.python.org/issue4356 http://bugs.python.org/issue1451588 http://bugs.python.org/issue3374 The last one summarizes the reasons of the rejection. The documentation (http://docs.python.org/library/bisect.html, "see also") contains a link to a "SortedCollection" recipe. I haven't looked at the SortedCollection class in detail, but you could try to have it included in the stdlib... -- Amaury Forgeot d'Arc

Hmm... I disagree with Raymond's rejection of the proposed feature. I have come across use cases for this functionality multiple time in real life. Basically Raymond says "bisect can call the key() function many times, which leads to bad design". His alternative, to use a list of (key, value) tuples, is often a bit clumsy when passing the sorted list to another function (e.g. for printing); having to transform the list using e.g. [v for (k, v) in a] feels clumsy and suboptimal. So I'm not sure that refusing the key= option always leads to the best design (in the sense of the most readable code). Adding key= is particularly attractive since the current invariant is something like "if a == sorted(a) before the operation, then a == sorted(a) after the operation". Adding a key= option would simply change that to sorted(a, key=key) on both counts. Also note that "many times" is actually O(log N) per insertion, which isn't so bad. The main use case for bisect() is to manage a list that sees updates *and* iterations -- otherwise building the list unsorted and sorting it at the end would make more sense. The key= option provides a balance between the cost/elegance for updates and for iterations. --Guido On Wed, Feb 8, 2012 at 2:18 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
-- --Guido van Rossum (python.org/~guido)

On 2/8/2012 7:25 PM, Guido van Rossum wrote:
An alternative to the n x 2 array is two n-arrays or a 2 x n array. Then there is no problem using either the keys or the values array. To use insort_right or insort_left for this, they would have to return the insertion position instead of None. Right now one must use bisect_right or _left and then .insert into both arrays instead of just the vals array.
For *large enough* lists, the O(n*n) cost of insertions will dominate the O(n*logN) key() calls, so reducing the latter to O(n) key calls will not matter. -- Terry Jan Reedy

On 2012-02-09, at 01:25 , Guido van Rossum wrote:
Yes, this is the kind of things which prompted my original email. It is even clumsier when there are many (smaller) lists to manipulate and insert into in turn, and requires two verbose and potentially expensive phases of decoration and undecoration. Using two separate lists has similar (though simpler) issues, especially when producing API-related structures as the "helper" collection must be cleaned up during an undecoration phase. And as Terry notes, this gets very clumsy as each candidate insertion now requires three statements (a call to bisect_right to get the insertion index followed by an insertion into the helper list and an other one for the actual list).

On 9 February 2012 00:25, Guido van Rossum <guido@python.org> wrote:
Also, in Python 3 one can't assume that values will be comparable so the (key, value) tuple trick won't work: comparing the tuples may well throw a TypeError. Here's a simple example below. The class 'Person' has no natural order, but we may want to keep a list of people sorted by iq:
-- Arnaud

On Feb 8, 2012, at 4:25 PM, Guido van Rossum wrote:
Also note that "many times" is actually O(log N) per insertion, which isn't so bad. The main use case for bisect() is to manage a list that sees updates *and* iterations -- otherwise building the list unsorted and sorting it at the end would make more sense. The key= option provides a balance between the cost/elegance for updates and for iterations.
Would you be open to introducing a SortedList class to encapsulate the data so that key functions get applied no more than once per record and the sort order is maintained as new items are inserted? ISTM, the whole problem with bisect is that the underlying list is naked, leaving no way to easily correlate the sort keys with the corresponding sorted records. Raymond

On Thu, Feb 9, 2012 at 9:27 AM, Raymond Hettinger < raymond.hettinger@gmail.com> wrote:
Hm. A good implementation of such a thing would probably require a B-tree implementation (or some other tree). That sounds like a good data type to have in the collections module, but doesn't really address the desire to use bisect on a list. Also it further erodes my desire not to bother the programmer with subtle decisions about the choice of data type: the most basic types (string, number, list, dict) are easy enough to distinguish, and further choices between subclasses or alternatives are usually a distraction.
That same "problem" would exist for sorted() and list.sort(), and the solution there (parameterize it with a key= option) easily generalizes to bisect (and to heapq, for that matter). The more fundamental "conflict" here seems to be between algorithms and classes. list.sort(), bisect and heapq focus on the algorithm. In some sense they reflect the state of the world before object-oriented programming was invented. Sometimes it is useful to encapsulate these in classes. Other times, the encapsulation doesn't add to the clarity of the program. One more thing: bisect.py doesn't only apply to insertions. It is also useful to find a "nearest" elements in a pre-sorted list. Probably that list was sorted using list.sort(), possibly using key=... -- --Guido van Rossum (python.org/~guido)

On Feb 9, 2012, at 9:48 AM, Guido van Rossum wrote:
The more fundamental "conflict" here seems to be between algorithms and classes. list.sort(), bisect and heapq focus on the algorithm.
Bisect in particular had way too much focus on the algorithm. The API is awkward and error-prone for many common use cases. I've tried to remedy that through documenting how to implement the common use cases: http://docs.python.org/py3k/library/bisect.html#searching-sorted-lists The issue is that the current API focuses on "insertion points" rather than on finding values. Unfortunately, this API is very old, so the only way to fix it is to introduce a new class. If we introduced class around a sorted sequence, then we could make an reasonable API that corresponds to what people usually want to do with sorted sequences. Of course, that still leaves the issue with an O(n) insort. As Daniel pointed-out, a list is not the correct underlying data structure if you want to do periodic insertions and deletions. Raymond

On Thu, Feb 9, 2012 at 12:34 PM, Raymond Hettinger < raymond.hettinger@gmail.com> wrote:
Maybe you're overanalyzing the problem? It seems what you want would require a PEP and/or a reference implementation that is thoroughly tested as a 3rd party package before it warrants inclusion into the stdlib. In the mean time adding a key= option that echoes the API offered by list.sort() and sorted() is a no-brainer. -- --Guido van Rossum (python.org/~guido)

On Wed, Feb 8, 2012 at 4:25 PM, Guido van Rossum <guido@python.org> wrote:
Maintaining a sorted list using Python's list type is a trap. The bisect is O(log n), but insertion and deletion are still O(n). A SortedList class that provides O(log n) insertions is useful from time to time. There are several existing implementations available (I wrote one of them, on top of my blist type), each with their pros and cons. -- Daniel Stutzbach

On Feb 9, 2012, at 9:50 AM, Daniel Stutzbach wrote:
Maintaining a sorted list using Python's list type is a trap. The bisect is O(log n), but insertion and deletion are still O(n).
A SortedList class that provides O(log n) insertions is useful from time to time. There are several existing implementations available (I wrote one of them, on top of my blist type), each with their pros and cons.
I concur. People who want to maintain sorted collections (periodically adding and deleting items) are far better-off using your blist, a binary tree, or an in-memory sqlite database. Otherwise, we will have baited them into a non-scalable O(n) solution. Unfortunately, when people see the word "bisect", they will presume they've got O(log n) code. We've documented the issue with bisect.insort(), but the power of suggestion is very strong. Raymond

On 2/9/2012 3:16 PM, Raymond Hettinger wrote:
The omitted constants are such that the log n term dominates for 'small' n. list.sort internally uses binary insert sort for n up to 64. It only switches to mergesort for runs of at least 64.
Are your blist leaves lists (or arrays) of some maximum size?
Using insort on a list of a millions items is definitely not a good idea, but I can see how someone not so aware of scaling issues might be tempted, especially with no stdlib alternative. One could almost be tempted to issue a warning if 'hi' is 'too large'. -- Terry Jan Reedy

On 2012-02-09, at 03:42 , Terry Reedy wrote:
http://bugs.python.org/issue4356 Suggests a ``key=`` argument behaving as with ``sorted`` and ``list.sort``: collection values are decorated with the key before comparisons. This is exactly my original email. http://bugs.python.org/issue1451588 Suggests a ``cmp=`` argument (proposal precedes Python 3 and ``key=`` taking over) to use instead of the built-in comparison operator. http://bugs.python.org/issue3374 Suggests all of ``cmp=`` (this again being opposed in ``cmp=`` having been dropped from Python 3), ``key=`` and ``reverse=``. In summary, all three suggest following the existing API of ``list.sort`` and ``sorted``, and at least implementing its ``key=`` argument (I am taking issue1451588 as doing so, since it suggests the mechanism and argument which preceded ``key``)
participants (7)
-
Amaury Forgeot d'Arc
-
Arnaud Delobelle
-
Daniel Stutzbach
-
Guido van Rossum
-
Masklinn
-
Raymond Hettinger
-
Terry Reedy