Exploiting type-homogeneity in list.sort() (again!)

(Summary of results: my patch at https://bugs.python.org/issue28685 makes list.sort() 30-50% faster in common cases, and at most 1.5% slower in the uncommon worst case.) Hello all, You may remember seeing some messages on here about optimizing list.sort() by exploiting type-homogeneity: since comparing apples and oranges is uncommon (though possible, i.e. float to int), it pays off to check if the list is type-homogeneous (as well as homogeneous with respect to some other properties), and, if it is, to replace calls to PyObject_RichCompareBool with calls to ob_type->tp_richcompare (or, in common special cases, to optimized compare functions). The entire patch is less than 250 lines of code, most of which is pretty boilerplate (i.e. a lot of assertions in #ifdef Py_DEBUG blocks, etc). I originally wrote that patch back in November. I've learned a lot since then, both about CPython and about mailing list etiquette :). Previous discussion about this can be found at https://mail.python.org/pipermail/python-dev/2016-October/146648.html and https://mail.python.org/pipermail/python-ideas/2016-October/042807.html. Anyway, I recently redid the benchmarks much more rigorously (in preparation for presenting this project at my school's science fair), achieving a standard deviation of less than 0.5% of the mean for all measurements. The exact benchmark script used can be found at https://github.com/embg/python-fastsort-benchmark (it's just sorting random lists of/lists of tuples of [type]. While listsort.txt talks about benchmarking different kinds of structured lists, instead of just random lists, the results here would hold in those cases just as well, because this makes individual comparisons cheaper, instead of reducing the number of comparisons based on structure). I also made a poster describing the optimization and including a pretty graph displaying the benchmark data: https://github.com/embg/python-fastsort-benchmark/blob/master/poster.pdf. For those who would rather read the results here (though it is a *really* pretty graph): *** Percent improvement for sorting random lists of [type] (1-patched/unpatched): float: 48% bounded int (magnitude smaller than 2^32): 48.4% latin string (all characters in [0,255]): 32.7% general int (reasonably uncommon?): 17.2% general string (reasonably uncommon?): 9.2% tuples of float: 63.2% tuples of bounded int: 64.8% tuples of latin string: 55.8% tuples of general int: 50.3% tuples of general string: 44.1% tuples of heterogeneous: 41.5% heterogeneous (lots of float with an int at the end; worst-case): -1.5% *** Essentially, it's a gamble where the payoff is 20-30 times greater than the cost, and the odds of losing are very small. Sorting is perhaps not a bottleneck in most applications, but considering how much work has gone into Python's sort (Timsort, etc; half of listobject.c is sort code), I think it's interesting that list.sort() can be made essentially twice faster by a relatively simple optimization. I would also add that Python dictionaries already implement this optimization: they start out optimizing based on the assumption that they'll only be seeing string keys, checking to make sure that assumption holds as they go. If they see a non-string key, they permanently switch over to the general implementation. So it's really the same idea, except here it doesn't matter as much what type we're dealing with, which is important, because lists are commonly used with lots of different types, as opposed to dictionaries, which overwhelmingly commonly use string keys, especially internally. (Correct me if I'm wrong in any of the above). I got a lot of great feedback/input on this patch as I was writing it, but after submitting it, I didn't hear much from anybody. (The reason I took so long to post was because I wanted to wait until I had the chance to do the benchmarks *right*). What do you all think? Thanks, Elliot

On Sat, Mar 4, 2017 at 11:19 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
Thanks for the hard work, this looks very promising.
This point seems wrong on its face. The main point of Timsort is to avoid comparisons in mostly-sorted lists. In makes a lot of difference whether the initial data is random or mostly-sorted. I presume the type homogeneity check takes a fixed O(N) time to perform. In the random case, the sort will take O(N * log N) comparisons. As a multiplier, naive comparisons are clearly more expensive than type check. But whether that type checking overhead is worthwhile obviously depends on the number of comparisons, which might be O(N) if the data is sorted. Real world data tends to be mostly-sorted. So it would be useful for your benchmarks to include: A) Performance on completely sorted data i) Of homogenous type ii) Of mixed type B) Performance on mostly-sorted data i) Of homogenous type ii) Of mixed type -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 10:51 AM David Mertz <mertz@gnosis.cx> wrote:
Thanks for the hard work, this looks very promising.
Thank you!
You are entirely correct, as the benchmarks below demonstrate. I used the benchmark lists from Objects/listsort.txt, which are: \sort: descending data /sort: ascending data 3sort: ascending, then 3 random exchanges +sort: ascending, then 10 random at the end %sort: ascending, then randomly replace 1% of elements w/ random values ~sort: many duplicates =sort: all equal My results are below (the script can be found at https://github.com/embg/python-fastsort-benchmark/blob/master/bench-structur... ): Homogeneous ([int]): \sort: 54.6% /sort: 56.5% 3sort: 53.5% +sort: 55.3% %sort: 52.4% ~sort: 48.0% =sort: 45.2% Heterogeneous ([int]*n + [0.0]): \sort: -17.2% /sort: -19.8% 3sort: -18.0% +sort: -18.8% %sort: -10.0% ~sort: -2.1% =sort: -21.3% As you can see, because there's a lot less non-comparison overhead in the structured lists, the impact of the optimization is much greater, both in performance gain and in worst-case cost. However, I would argue that these data do not invalidate the utility of my patch: the probability of encountering a type-heterogeneous list is certainly less than 5% in practice. So the expected savings, even for structured lists, is something like (5%)(-20%) + (95%)(50%) = 46.5%. And, of course, not *all* the lists one encounters in practice are structured; certainly not *this* structured. So, overall, I would say the numbers above are extremely encouraging. Thanks for pointing out the need for this benchmark, though! Elliot

Good job. I'll read your work later.
Please notice dict is very very special. Since Python is dynamic language, name resolution is done in runtime, not compile time. And most name resolution cause lookup string key from dict. So please don't use string-key dict specialization to rationalize list-sort specialization. Each specialization should be considered carefully about balance between maintenance cost and benefit. I feel 2 (int, string) or 3 (int, string, tuple) special case may be worth enough if code is clean and benefit seems good. But more complex cases can be done in third party libraries, like numpy. In third party library, you can use more powerful optimization like nan-boxing. In Python builtin list, this is hard because we guarantee object identity is not changed. Regards,

On Sun, Mar 5, 2017 at 6:30 PM INADA Naoki <songofacandy@gmail.com> wrote:
Good job. I'll read your work later.
Thanks! So please don't use string-key dict specialization to rationalize
list-sort specialization.
I'm not quite doing that: I agree that dict is a special case because of internal use. I'm just saying that I'm not the first person to try to exploit type homogeneity in data structures in CPython. However, my justification is independent of dict specialization. Namely, it's the following two considerations: (1) lists are idiomatically type-homogeneous. To quote the Python documentation, "Lists are mutable, and *their elements are usually homogeneous* and are accessed by iterating over the list". (2) it's not very common to have to "compare apples and oranges". While it's technically possible to define comparison between any two types you like, and in practice one sometimes compares e.g. ints and floats, in practice it's pretty safe to assume the lists you're sorting are going to be type-homogeneous 95% or 99% of the time. I feel 2 (int, string) or 3 (int, string, tuple) special case may be worth
enough if code is clean and benefit seems good.
I agree -- I only optimized for int, string, float, and tuple. However, my code optimizes sorting for *any* type-homogeneous list as well, by simply replacing PyObject_RichCompareBool with ob_type->richcompare, saving the method lookup time. This is what gives the speedups for non-latin strings and bigints in the benchmarks I shared in my original post: I don't have a special case compare function for them, but by setting the compare function pointer equal to ob_type->richcompare, I can at least save a little bit of method lookup time. In terms of clean code, the special-case compare functions include assertions in #ifdef Py_DEBUG blocks that list all the requirements necessary to make them safe. The pre-sort check then verifies the assumptions for whichever optimized compare is going to be used. So all that's necessary to verify correctness is to convince oneself that (1) the assertions are sufficient and (2) the pre-sort check will never use a compare function for which it cannot prove the assertions. These two points are pretty easy to check: (1) Just plug in the assertions in the original code, e.g. unicode_compare in Objects/unicodeobject.c. Then you have a bunch of if(1) and if(0) blocks, and if you delete the if(0) blocks you'll get exactly what I have in the patch. (2) The pre-sort check code is very simple, and well commented. I would add further that this patch only actually adds one line of code to listsort: assign_compare_function(lo.keys, saved_ob_size). The rest of the patch is just function definitions, and replacing PyObject_RichCompareBool with a function pointer (in the macro for ISLT(X,Y)). Anyway, thanks in advance for your time! Elliot

On Mon, Mar 6, 2017 at 12:53 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I would be rather curious to know how frequently a list consists of "numbers", but a mix of ints and floats. From the point of view of a list's purpose, they're all numbers (point 1 satisfied), and they're all comparable (point 2 satisfied), but from the POV of your patch, it's heterogeneous and suffers a performance penalty. Does it happen a lot in real-world code? (Apologies if this has already been mentioned; I've been skimming the thread, not reading it in detail.) ChrisA

On Sun, Mar 5, 2017 at 7:50 PM Chris Angelico <rosuav@gmail.com> wrote:
This is of course undecidable to verify statically, so we can't just crawl PyPI... however, I would argue that using mixed float-int lists is dangerous, and is more dangerous in Python 3 than in Python 2. So hopefully this is not very common. However, even if 10% (surely a vast overestimate) of sort calls are to mixed int-float lists, my patch would still yield a significant savings on average. (Apologies if this has already been mentioned; I've been skimming the
thread, not reading it in detail.)
It hasn't been mentioned in this thread :)

On Mon, Mar 6, 2017 at 2:03 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I agree that it's dangerous, but it is still common for programmers and Python alike to treat 10 as functionally identical to 10.0 - although as to being more dangerous in Py3, that's much of a muchness (for instance, the single-slash division operator in Py2 can potentially truncate, but in Py3 it's always going to give you a float). But, fair point. I very much doubt it's as high as 10%, so yeah, that would be advantageous. Also, the performance hit is so small, and even that is in the very worst case (a homogeneous list with one different type at the end). I like changes that make stuff run faster. ChrisA

In my experience and/or estimation... well two things: A. Mixtures of floats/ints seem a lot more common than the 10% being thrown around. I don't even consider that bad practice currently (it might become bad practice if Elliot's patch is used, since keeping elements float-only would allow much faster sorting). B. I think a very large percentage of lists are heterogeneous. But most of the time when they are, it's not because there are several different numeric types but rather because a list collects some sort of custom objects. Maybe those even all share some ancestor, but the point is they are user/library defined classes that won't have a fast comparison like ints or floats do. B (part 2): I haven't actually read his patch, but I'm assuming that Elliot's approach can start scanning the list, and as soon as it find a complex/custom object at index 0 ignore the rest of the scan. So for that case, there is very little harm in his linear scan (over one item). On Sun, Mar 5, 2017 at 7:09 PM, Chris Angelico <rosuav@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 8:20 PM David Mertz <mertz@gnosis.cx> wrote:
As long as the ob_type for all the objects is the same, my patch will sort significantly faster, as it replaces PyObject_RichCompareBool with ob_type->tp_richcompare. It doesn't matter if your list is builtin-types or not, as long as the types are all the same, you get a speedup (though it's greater for built-in types). Also, I actually wrote a crawler to see how many PyPI packages implement a custom compare function (like you would have to for user-defined types). The answer is less than 6%. Code: https://github.com/embg/python-fast-listsort/tree/master/crawler
Yup, the pre-sort check breaks out of the loop as soon as it sees heterogeneity. So unless your float is at the end of the whole list of ints (as in my worst-case benchmark), the cost is basically 0.

I see the benchmarks, and while I assume the asymptotic complexity is the same, is there a longer "start-up" time your optimizations need? Do you have benchmarks where you sort 10, 100...10**6 items to show that beyond the types you're sorting, you're not amortizing any increased overhead out to oblivion? On Sun, Mar 5, 2017 at 9:24 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:

On Sun, Mar 5, 2017 at 8:29 PM Nick Timkovich <prometheus235@gmail.com> wrote:
This is addressed in my post to the bug tracker with a perf benchmark (link in the first email in this thread). In short: the pre-sort check is really cheap, even for tiny lists. Though it gets cheaper (percent-wise) as you get bigger. You could also trivially modify the benchmark script I linked to in my original post to check for this.

On 2017-03-06 03:09, Chris Angelico wrote:
Although it's true that both programmers and Python might treat 10 as functionally identical to 10.0, in practice the numbers that are being added to the list probably come from some code that returns integers /or/ floats, rather than a mixture.

On Sun, Mar 5, 2017 at 9:12 PM MRAB <python@mrabarnett.plus.com> wrote:
Yes, exactly. So we can see how the homogeneity assumption is a reasonable one to make; unless you're defining custom compares (uncommon), I don't see where you would ever be sorting a heterogeneous list except for the int/float case.

On Sun, Mar 05, 2017 at 07:19:43AM +0000, Elliot Gorokhovsky wrote:
I sometimes need to know if a list is homogenous, but unfortunately checking large lists for a common type in pure Python is quote slow. Here is a radical thought... why don't lists track their common type themselves? There's only a few methods which can add items: - append - extend - insert - __setitem__ Suppose we gave lists a read-only attrribute, __type_hint__, which returns None for hetrogeneous lists and the type for homogeneous lists. Adding an item to the list does as follows: - if the list is empty, adding an item sets __type_hint__ to type(item); - if the list is not empty, adding an item tests whether type(item) is identical to (not a subclass) of __type_hint__, and if not, sets __type_hint__ to None; - removing an item doesn't change the __type_hint__ unless the list becomes empty, in which case it is reset to None; - if the internal allocated space of the list shrinks, that triggers a recalculation of the __type_hint__ if it is currently None. (There's no need to recalculate the hint if it is not None.) Optional: doing a list.sort() could also recalculate the hint. The effect will be: - if __type_hint__ is a type object, then you can be sure that the list is homogeneous; - if the __type_hint__ is None, then it might still be homogeneous, but it isn't safe to assume so. Not only could sorting take advantage of the type hint without needing to do a separate O(N) scan of the list, but so could other code. I know I would be interested in using this. I have a fair amount of code that has to track the type of any items seen in a list, and swap to a "type agnostic but slow" version if the list is not homogeneous. I could probably replace that with some variation of: if thelist.__type_hint__ is None: process_slow(thelist) else: process_fast(thelist) At the very least, I'd be interested in experimenting with this. Thoughts? -- Steve

This is exactly what I would have done if I didn't think the changes would be too large-scale to ever be accepted (at least if proposed by someone without any experience). This is also what my friend suggested when I explained my project to him. In fact, this is exactly what dictionaries do: at each insert, they check whether a string is being inserted. If yes, then stay in a "string" state, if no, then switch to your None state. Again, the only reason I did it differently is because I wanted to be realistic about what changes I could get accepted. My patch does everything inside the list sort function, and doesn't modify anything outside of that. So it's presumably much easier to review. I mean, I submitted my patch in November, and it still hasn't been accepted :) so presumably something as large-scale as this would have very little chance. One more thing: changing mutating methods to check type would make them non-trivially slower; if you aren't resizing the list, appending should really just be as simple as storing a pointer and incrementing a counter. If we have to check type, that turns two simple things into three simple things, which could have significant performance implications. Most applications of lists *don't* care about type homogeneity (i.e. iterating through the list or whatever), so one would have to justify imposing this cost on all list use just to optimize for the special cases where you *do* care. I'm not sure how one would go about checking if that trade-off is a good one or not: what is the distribution of list use across all Python code? It's not something you can really check statically (it's undecidable if you're being pedantic). With my patch, it's easy: if you aren't sorting, you don't have to pay anything. If you are sorting, you either win or lose, based on whether or not you're sorting type-homogeneous data or not (which you basically always are). If anything, I think my patch could be a starting point for later optimization along these lines, depending on whether the problem raised in the previous paragraph can be addressed. What do you think? On Sun, Mar 5, 2017 at 7:47 PM Steven D'Aprano <steve@pearwood.info> wrote: On Sun, Mar 05, 2017 at 07:19:43AM +0000, Elliot Gorokhovsky wrote:
I sometimes need to know if a list is homogenous, but unfortunately checking large lists for a common type in pure Python is quote slow. Here is a radical thought... why don't lists track their common type themselves? There's only a few methods which can add items: - append - extend - insert - __setitem__ Suppose we gave lists a read-only attrribute, __type_hint__, which returns None for hetrogeneous lists and the type for homogeneous lists. Adding an item to the list does as follows: - if the list is empty, adding an item sets __type_hint__ to type(item); - if the list is not empty, adding an item tests whether type(item) is identical to (not a subclass) of __type_hint__, and if not, sets __type_hint__ to None; - removing an item doesn't change the __type_hint__ unless the list becomes empty, in which case it is reset to None; - if the internal allocated space of the list shrinks, that triggers a recalculation of the __type_hint__ if it is currently None. (There's no need to recalculate the hint if it is not None.) Optional: doing a list.sort() could also recalculate the hint. The effect will be: - if __type_hint__ is a type object, then you can be sure that the list is homogeneous; - if the __type_hint__ is None, then it might still be homogeneous, but it isn't safe to assume so. Not only could sorting take advantage of the type hint without needing to do a separate O(N) scan of the list, but so could other code. I know I would be interested in using this. I have a fair amount of code that has to track the type of any items seen in a list, and swap to a "type agnostic but slow" version if the list is not homogeneous. I could probably replace that with some variation of: if thelist.__type_hint__ is None: process_slow(thelist) else: process_fast(thelist) At the very least, I'd be interested in experimenting with this. Thoughts? -- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Mar 5, 2017 at 6:45 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Here is a radical thought... why don't lists track their common type themselves? There's only a few methods which can add items:
I had exactly the same thought. Lists would need to grow a new attribute, of course. I'm not sure how that would affect the object layout and word boundaries. But there might be free space for another attribute slot. The real question is whether doing this is a win. On each append/mutation operation we would need to do a comparison to the __type_hint__ (assuming Steven's spelling of the attribute). That's not free. Balancing that, however, when we actually *did* a sort, it would be O(1) to tell if it was homogeneous (and also the actual type if yes) rather than O(N). This question isn't really subject to microbenchmarks though. If we added __type_hint__ as None/type-object and added those comparisons to it on .insert()/.append()/etc, then we would be slower by some increment while all we were doing was adding things. There could only be a win when the list is sorted (but a big win for that case). In real world code, how often are lists sorted? Also I'm not really confident that Elliot's estimates of 95% of lists being homogeneous holds, but the speedup he proposes would seem to win even if that percentage is a lot lower than 95%. If only 10% of lists in real world code ever get `my_list.sort()` called on them, Steven's idea is probably not good. If 50% of lists do, it probably is. But then, it depends just how *often* lists that get sorted are re-sorted too. Yours, David... P.S. I think that given that we can .append() then delete items to make a heterogenous list become homogeneous again, Elliot's idea is somewhat orthogonal to Steven's. That is, even if we had .__type_hint__ on lists, it might be a "false None" and it could still be worth doing Elliot's linear scan anyway. On the other hand, the None-ness on non-empty lists might be a good enough predictor of heterogeneity in real world code that the linear scan would almost always be a waste. I do not know without benchmarks against real codebases. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 8:09 PM David Mertz <mertz@gnosis.cx> wrote:
Exactly.
I would imagine that fewer than even 10% of lists in real world code ever get sorted. I mean, just crawl PyPI and look for `.sort()` or `sorted()`; you'll find it's not that common. I know, because I was hoping to be able to demonstrate non-trivial improvements on the benchmark suites, but they just don't sort enough for it to show up. You only see application-level speedups if you're doing a *lot* of sorting, like in DEAP Pareto selection (DEAP is a popular Python evolutionary algorithms library; you sometimes have to sort individuals in the population by fitness, and the population is huge). And the cost of doing the pre-sort check isn't that bad, anyway... I think that making *every* append slower just for sorting/etc would be a net loss. Anyway, like I said earlier, my patch could be a first step towards a more broad optimization like this.

On Sun, Mar 5, 2017 at 7:16 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
I think I must sort things a lot more often than most people :-). But also, it's not just the number of list objects that are sorted, but how often it's done. I could have a 10,000 line program with only one call to `my_list.sort()` in it... but that one line is something that is called inside an inner loop. This feels like it really needs profiling not just static analysis. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 8:23 PM David Mertz <mertz@gnosis.cx> wrote:
Totally -- but what I mean is if you look at the performance benchmark suite, for example, most of the benchmarks do not have the string "sort" in their source code (IIRC). I think that most applications don't spend most of their time sorting. That's not to say sorting isn't important -- I wouldn't have written my patch if I didn't think it is. I'm just saying it probably isn't a good idea to penalize *all* list use just to benefit the minority of list use that involves sorting.

On 06/03/17 03:08, David Mertz wrote:
I don't think anyone has mentioned this yet, but FWIW I think the 'type hint' may need to be tri-state: heterogeneous (NULL), homogeneous (the pointer to the type structure) and also "unknown" (a sentinel value - the address of a static char or something). Otherwise, a delete operation on the list would need to scan the list to work out if it had changed from heterogeneous to homogeneous (unless it was acceptable that once heterogeneous, a list is always considered heterogeneous - i.e., delete always sets the hint to NULL). Instead, delete would change a NULL hint to the sentinel (leaving a valid type hint as it is) and then prior to sorting - as the hint is being checked anyway - if it's the sentinel value, perform the pre-scan that the existing patch is doing to restore the knowledge of just what type of list it is. I'd prefer the sort optimization to be based on what my list contains NOW, not on what it may have contained some time in the past, so I'm not a fan of the "once heterogeneous, always considered heterogeneous" behaviour if it's cheap enough to avoid it. E.

On Tue, Mar 7, 2017 at 1:47 PM Erik <python@lucidity.plus.com> wrote:
Sure. Dictionaries actually don't implement this, though: as soon as they see a non-string key, they permanently switch to a heterogeneous state (IIRC). I think the bigger problem, though, is that most list use does *not* involve sorting, so it would be a shame to impose the non-trivial overhead of type-checking on *all* list use. With dictionaries, you have to type-check inserts anyway (for equality testing), so it's less of a problem. But the fundamental list operations *don't* require type-checking currently, so why add it in? In practice, the pre-sort check is *very* cheap, and its cost is only imposed on sort usage. Anyway, my patch could always be a precursor to a more general optimization along these lines. I'm almost finished fixing the problem Tim identified earlier in this thread; after that, it'll be ready for review!

Hi Elliot, On 07/03/17 21:10, Elliot Gorokhovsky wrote:
I'd be interested to know if this approach had been considered and rejected for dicts - but I think dicts are a bit of a special case anyway. Because they are historically a fundamental building block of the language (for name lookups etc) they are probably more sensitive to small changes than other objects.
Yes, I understand that issue - I just thought I'd mention something that hadn't been pointed out yet _IF_ the idea of a type hint were to be considered (that's the sub-thread I'm replying to). If you're not doing that, then fine - I just wanted to put down things that occurred to me so they were documented (if only for rejection). So, while I'm at it ;), here are some other things I noticed scanning the list object source (again, only if a type hint was considered): * What is the type hint of an empty list? (this probably depends on how naturally the code for all of the type hint checking deals with NULL vs "unknown"). * listextend() - this should do the right thing with the type hint when extending one list with another. * Several other methods ('contains', 'remove', 'count', 'index') also use PyObject_RichCompareBool(). They could also presumably benefit from the same optimisation (perhaps it's not all about sort() - perhaps this gives a little more weight to the idea).
Nice - good job. E.

On Tue, Mar 7, 2017 at 4:36 PM, Erik <python@lucidity.plus.com> wrote:
Good point about list.extend(). I don't think __type_hint__ could help with .__contains__() or .count() or .remove(). E.g.: In [7]: lst = [1.0, 2.0, 1+0j, F(1,1)] In [8]: from fractions import Fraction as F In [9]: lst = [1.0, 2.0, 1+0j, F(1,1)] In [10]: 1 in lst Out[10]: True In [11]: lst.count(1) Out[11]: 3 In [12]: l.index(1) Out[12]: 0 The list has absolutely nothing of the right type. Yet it contains an item, counts things that are equal, finds a position for an equal item. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Hi David, On 07/03/17 22:39, David Mertz wrote:
Sure, but if the needle doesn't have the same type as the (homogeneous) haystack, then the rich comparison would still need to be done as a fallback (and would produce the result you indicate). But if the needle and the homogeneous haystack have the _same_ type, then a more optimised version of the operation can be done. Regards, E.

On Tue, Mar 7, 2017 at 6:27 PM, Erik <python@lucidity.plus.com> wrote:
In [22]: class Eq(int): def __eq__(self, other): return True ....: In [23]: four, five, six = Eq(4), Eq(5), Eq(6) In [24]: lst = [four, five, six] In [25]: lst.count(Eq(7)) Out[25]: 3 How would this work (other than saying "don't do that it's perverse")? -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Tue, Mar 7, 2017 at 4:53 PM David Mertz <mertz@gnosis.cx> wrote:
There would be two needless checks in the equality testing. First, PyObject_RichCompareBool would see if other is a subclass of self, in which case other->tp_richcompare would be used iff. it is non-null. Otherwise, we would check if self->tp_richcompare is non-null, the check would pass, and we would call self.__eq__. See the flow chart on my poster (linked to in the first email on this thread).

On Tue, Mar 07, 2017 at 09:10:00PM +0000, Elliot Gorokhovsky wrote:
I think the bigger problem, though, is that most list use does *not* involve sorting,
There are other uses for a type hint apart from sorting. There may be optimized versions of other functions (sum, math.fsum, ...) and list methods (e.g. count, remove), anything that has to walk the list comparing items). All very hypothetical at the moment, I admit it...
so it would be a shame to impose the non-trivial overhead of type-checking on *all* list use.
You might be right, but my guess is that the overhead isn't quite as big as you may think, at least for some operations. The type-check itself is just a pointer compare, not an issubclass() operation. The types are either identical, or the list is hetrogeneous. True, `mylist[i] = x` should be a quick assignment into an array of pointers, but there's a bounds check to ensure i is within the correct range. Other methods are even more expensive: `mylist[i:j] = [a...z]` has to move list items around, possibly allocate more memory, and update the list length, compared to that the scan of a...z is probably minimal. `mylist.append(x)` is *usually* a simple assignment into the array (plus updating the list length) but every now and again it triggers a resize, which is costly. I don't think we can predict whether this is a nett gain or loss just from first principles.
Anyway, my patch could always be a precursor to a more general optimization along these lines.
Indeed! Even if nothing else comes of this than your patch, thank you! -- Steve

On Tue, Mar 07, 2017 at 08:46:45PM +0000, Erik wrote:
I thought about that and rejected it as an unnecessary complication. Hetrogeneous and unknown might as well be the same state: either way, you cannot use the homogeneous-type optimization. Part of the complexity here is that I'd like this flag to be available to Python code, not just a hidden internal state of the list. But my instinct for this could be wrong -- if anyone is interested to do some experiments on this, it might be that a three-state flag works out better in practice than two-states.
In a later email, you corrected this: a delete operation need not touch the type-hint (except when the last item is deleted, at which point it resets to None/unknown. With a three-state flag, you can make a three-way decision: If the flag is Unknown: - do a O(N) scan to determine whether the list is homogeneous or hetrogeneous, then choose the optimized or unoptimized routine; if the flag is Hetrogeneous: - there is no point doing a scan, so always choose the unoptimized routine; if the flag is a type: - the list is definitely homogeneous, so depending on the type, you may be able to choose an optimized rountine. Compared to that, a two-state flag misses some opportunities to run the optimized routine: - list contains [1, 2, 3, 4, "foo"] so the hint is set to None; - delete the last item; - list is now homogeneous but the hint is still None. But also avoids bothering with an O(N) scan in some situations where the list really is hetrogeneous. So there's both an opportunity cost and a benefit. There may be other opportunities to do the scan, such as when the underlying pre-allocated array resizes, so even with the two-state flag, "unknown" need not stay unknown forever.
I'd prefer the sort optimization to be based on what my list contains NOW, not on what it may have contained some time in the past,
Remember, we're talking about opportunities for applying an optimization here, nothing more. You're not giving up anything: at worst, the ordinary, unoptimized routine will run and you're no worse off than you are today.
It is not just a matter of the cost of tracking three states versus two. It is a matter of the complexity of the interface. I suppose this could be reported to Python code as None, False or the type. Although, I don't know about you, but I know I'll never be able to remember whether None means Unknown and False means Hetrogeneous, or the other way around. Ultimately, this is all very pie-in-the-sky unless somebody tests just how expensive this is and whether the benefit is worthwhile. That's not going to be me: I'm not able to hack on the list C code. -- Steve

On 08/03/17 00:18, Steven D'Aprano wrote:
Knowing it's definitely one of two positive states and not knowing which of those two states it is is not the same thing when it comes to what one can and can't optimize cheaply :) It sort of depends on how cheaply one can track the states though ...
Part of the complexity here is that I'd like this flag to be available to Python code, not just a hidden internal state of the list.
Out of interest, for what purpose? Generally, I thought Python code should not need to worry about low-level optimisations such as this (which are C-Python specific AIUI). A list.is_heterogeneous() method could be implemented if it was necessary, but how would that be used?
O(N) is worst case. Most of the anecdotal evidence in this thread so far seems to suggest that heterogeneous lists are not common. May or may not be true. Empirically, for me, it is true. Who knows? (and there is the question).
You are a little bit - the extra overhead of checking all of this (which is the unknown factor we're all skirting around ATM) costs. So converting a previously-heterogeneous list to a homogeneous list via a delete or whatever has a benefit if the optimisations can then be applied to that list many times in the future (i.e., once it becomes recognised as homogeneous again, it benefits from optimised paths in the interpreter). And of course, all that depends on your use case. It might work out better for one application over another. As you quite rightly point out, it needs someone to measure the alternatives and work out if _overall_ it has a positive impact ...
I didn't think any of this stuff would come back to Python code (I thought we were talking about C-Python specific implementation only). How is this useful to Python code?
Ultimately, this is all very pie-in-the-sky unless somebody tests just how expensive this is and whether the benefit is worthwhile.
I agree. As I said before, I'm just pointing out things I noticed while looking at the current C code which could be picked up on if someone wants to try implementing and benchmarking any of this. It sort of feels like an argument, but I hope we're just violently agreeing on a generally shared goal ;) Regards, E.

On Wed, Mar 08, 2017 at 01:20:19AM +0000, Erik wrote:
I mentioned earlier that I have code which has to track the type of list items, and swaps to a different algorithm when the types are not all the same. In practice, I just run the "mixed types" algorithm regardless, because the cost of doing a scan of the items in pure Python is too expensive relative to the time saved. Moving some of the work into the C infrastructure might change that. I'm not completely sure that this would, in fact, be useful to me, but I'd like the opportunity to experiment. I could try using a list subclass, but again, the cost of doing the type-checking in Python instead of C is (I think) prohibitive. Nevertheless, I understand that the burden of proving the usefulness of this is on me. (Or anyone else that wants to argue the case.)
A list.is_heterogeneous() method could be implemented if it was necessary, but how would that be used?
I would prefer to get the list item's type: if mylist.__type_hint__ is float: # run optimized float version ... elif mylist.__type_hint__ is int: ... else: # unoptimized version
It is also the best and average case. Given a homogenous list of N items, for some arbitrary N between 0 and ∞, you have to look at all N items before knowing that they're all the same type. So for the homogenous case, the best, worst and average are identically O(N). Given a hetrogeneous list of N items where the first difference is found at index K, K can range from 1 through N-1. (By definition, it cannot be at index 0: you can only detect a difference in types after checking *two* items.) The worst case is that you don't find the difference until the last item, which is O(N). The best case is that you find the difference in position 1 and bail out early. On average, you will find the first difference halfway through the list, which makes it O(N/2) but that's just O(N). (Constant factors don't matter.) If you want to call that O(1) for the best hetrogeneous case, I won't argue except to say that's rather against the spirit of Big Oh analysis in my opinion. I think it's more realistic to say its O(N) across all combinations of best/worst/average and homogeneous/hetrogeneous. But of course if you bail out early, the constant multiplier may differ.
That will strongly depend on where the data is coming from, but when it comes to sorting, randomly mixed types will usually fail since comparisons between different types are generally illegal: py> [1, "a"].sort() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unorderable types: str() < int() -- Steve

On 08/03/17 11:07, Steven D'Aprano wrote:
Hmmm. Yes, I guess if the expensive version requires a lot of isinstance() messing or similar for each element then it could be better to have optimized versions for homogeneous lists of ints or strings etc.
If you know the list is homogeneous then the item's type is "type(mylist[0])". Also, having it be a function call gives an obvious place to put the transition from "unknown" to known state if the tri-state hint approach was taken. Otherwise, that would have to be hooked into the attribute access somehow. That's for someone who wants to try implementing it to decide and propose though :) E.

Can you assume that list of of type(list[0]) and use that type's optimised sort? But in the optimised sort code check that the types are as required. If you hit an element that is not of the required type then fall back to the unoptimised sort. So part of list is sorted using optimised code and the sort is completed with the unoptimised code. Is that sematically clean? Barry

On Wed, Mar 8, 2017 at 2:14 PM Barry <barry@barrys-emacs.org> wrote:
Well, how would you tell if you've hit an element that is not of the required type? You'd have to check during every compare, right? And that's precisely what we're trying to avoid! The whole point of my patch is that we do O(nlogn) compares, but only have O(n) elements, so it's much cheaper to do all the type checks in advance, and in the very likely case that our list is homogeneous, switch to an optimized special-case compare function. Even when we only do O(n) compares, my patch is still much faster (see benchmarks higher up in this thread). Why? Well, if you're doing the type checks during the compares, you're doing them across different function calls, with other stuff interspersed in between. So pipeline/branch prediction/caching is less able to optimize away the overhead of the safety checks (I don't know how CPUs work, but one of those is relevant here). With the pre-sort check, it's all in a single iteration through the list, and we're taking the same exact branch every time; it's much faster. Think iterating over a matrix row-wise versus iterating column-wise (not exactly relevant here since that's about cache, but that's the idea. Again, I don't really know how CPUs work). So in short: no, we can't just check as we go, because that's what we're already doing. But we can optimize significantly by checking in advance. I mean, practically speaking, the advance check is basically free. The compare-time checks, in sum, are not, both asymptotically and practically. Best Elliot

On Wed, Mar 8, 2017 at 2:58 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
The whole point of my patch is that we do O(nlogn) compares, but only have O(n) elements, so it's much cheaper to do all the type checks in advance,
I mean, practically speaking, the advance check is basically free. The compare-time checks, in sum, are not, both asymptotically and practically.
hmm -- I know folks like to say that "the constants don't matter", but it seems they do in this case: without pre-checking: O(nlogn) with pre-checking If homogenous: O(n) + O(nlogn) so the pre-checking only adds if you ignore the constants.. But I'm assuming (particularly with locality and branch prediction and all that included) the constant to type-check is much smaller than the constant to compare two unknown types, so: TC*n + KC*n*logn vs UC*n*logn where: TC -- Constant to type check KC -- Constant known compare UC -- Constant unknown type check So if UC > TC/logn + KC Then this optimization make sense. If UC >KC and UC >>> TC, then this all works out. But if n is HUGE, it may end up being slower (maybe more so with cache locality???) Which is why you need to profile all this carefully. So far Elliott's experiments seem to show it works out well. Which doesn't surprise me for built-ins like float and int that have a native compare, but apparently also for more complex types? How about custom classes? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

What it seemed the trick for optimisation is is to compare the type pointer of an object to see if its the same as a type supported by the chosen optimised sort. It was not clear to me that you need to scan the list at the start to make sure its homogeneous. Given that the type check is so cheap will it slow the sort if you do the pointer check in the compare code? I am not suggesting you run rich compare full fat on each compare.
The whole point of my patch is that we do O(nlogn) compares, but only have O(n) elements, so it's much cheaper to do all the type checks in advance, and in the very likely case that our list is homogeneous, switch to an optimized special-case compare function.
So you do O(nlogn)*2 pointer compares with my suggestion it seems? Which is smaller the O(n) pointer checks?
Even when we only do O(n) compares, my patch is still much faster (see benchmarks higher up in this thread). Why? Well, if you're doing the type checks during the compares, you're doing them across different function calls, with other stuff interspersed in between. So pipeline/branch prediction/caching is less able to optimize away the overhead of the safety checks (I don't know how CPUs work, but one of those is relevant here). With the pre-sort check, it's all in a single iteration through the list, and we're taking the same exact branch every time; it's much faster. Think iterating over a matrix row-wise versus iterating column-wise (not exactly relevant here since that's about cache, but that's the idea. Again, I don't really know how CPUs work).
Provided the code is small I think both versions of the algorithm will benefit from cache and branch prediction.
So in short: no, we can't just check as we go, because that's what we're already doing. But we can optimize significantly by checking in advance. I mean, practically speaking, the advance check is basically free. The compare-time checks, in sum, are not, both asymptotically and practically.
I not clear this is a true. But I have not read the code. Barry

On Thu, Mar 9, 2017 at 3:04 AM Barry Scott <barry@barrys-emacs.org> wrote:
So you do O(nlogn)*2 pointer compares with my suggestion it seems? Which is smaller the O(n) pointer checks?
Not sure what you mean here... pretty sure your inequality is backwards? Look -- the point is, we already *do* the pointer checks during every compare. That's the current implementation. I believe my benchmarks are sufficient to prove that my patch is much faster than the current implementation. Which makes sense, because we do one type check per object, as opposed to something like log n per object, and we do them all at once, which I do believe is faster than doing them across calls. Now, you're not entirely wrong -- the current implementation doesn't do the type checks as efficiently as it could. Specifically, it does them once in PyObject_RichCompareBool, and then *again* in ob_type->tp_richcompare (which it has to for safety: ob_type->tp_richcompare doesn't know the arguments have already been type-checked!) You could totally define optimized compares that do the type-checks more efficiently, and you would see a benefit, just not nearly as extreme as the benefit I demonstrate. Thanks again for your feedback! Elliot

On 9 March 2017 at 21:04, Barry Scott <barry@barrys-emacs.org> wrote:
Isn't there already always a scan of the iterable to build the keys array for sorting (even if no key keyword param is specified)? In which case adding the homogenity check there seems like it shouldn't add much overhead at all (I have to say that I was surprised with 10+% reductions in speed in some of the heterogenous TimSort tests for this reason). And could specific richcompares be refactored so there was a "we really know what the types are is, no need to check" version available to sort() (with the typechecking version available for general use/unoptimised sorting)? Tim Delaney

[Tim Delaney <timothy.c.delaney@gmail.com>]
Isn't there already always a scan of the iterable to build the keys array for sorting (even if no key keyword param is specified)?
No - `.sort()` is a list method, and as such has nothing to do with arbitrary iterables, just lists (perhaps you're thinking of the `sorted()` function?). If no `key=` argument is passed, the list guts itself is used (as-is) as the vector of keys.
Those are worst cases where the current sort does very few compares (like just N-1 for a list of length N). Because they do "amazingly" few compares, they're already "amazingly" fast. And they also do little data movement (e.g., none at all for /sort & =sort, and N//2 pointer swaps for \sort). Because of that any new O(N) overhead would make them significantly slower - unless the new overhead pays off by allowing a larger time saving than it costs.(as it does when the list is same-type). There is a "natural" place to insert "same type?" checks: the outer loop of the sort marches over the vector once, left to right, alternately identifying the next natural run, then possibly extending it and/or merging it into previous runs. The checks could be put there instead, but the code would be ugly and more invasive - I wouldn't even bother trying it.
They're not _just_ type-aware. For example, the special function for ints is specialized to integers that happen to fit in one (internal) "digit", and the special function for strings is specialized to those that happen to be stored in PyUnicode_1BYTE_KIND format. Adding such stuff to the public C API would be ... well, for a start, tedious ;-)

On Thu, Mar 9, 2017 at 2:04 AM, Barry Scott <barry@barrys-emacs.org> wrote:
I think you have a point here. IIUC, the current code makes no assumptions about type homogeneity. So it must do something generic at each compare. Perhaps that generic thing (of course, Elliot knows that is) does do a pointer compare, and then something smart and fast for some built-ins. but there is still a few steps: these are both the same type what type are they how do I compare them? do the compare These may well all be fast, but it's still a few steps, and have to be done O(n*log(n)) times IIUC, Elliot's patch does something like: First go through the whole list and do: What is the type of the first item (only once) How do I compare that type (only once) Is everything else in the list the same type ( O(n) ) Then the actual sort: - do the compare ( O(n*log(n)) ) So there is one operation done n times, and one done n*log(n) times, rather than 4 done n*log(n) times, if the constants are about the same, that saves 3n*log(n) - n operations, or -- if n is non-trivial, then n*3*log(n) operations so we go from 4*n*log(n) to n*log(n) about a 4 times speed up -- in theory. with all the other complications of computer performance, only profiling will tell you! (also, it's not 4 times on the whole sort -- just the compare part -- you still need to shuffle the values around, which presumably takes at last as long as each of the above operations) (not sure about all four of those steps, it may be only three -- but still a 3 time speed up) Now Barry's idea: Assume that the list is homogenous, and crap out if it turns out it's not: Start off: What is the type of the first item (only once) How do I compare that type (only once) Do the sort: Is the next item the correct type? O(n*log(n)) Do the compare ( O(n*log(n)) ) So we now have 2 operations to be run O(n*log(n)) -- so not as good at only one, but still better than the 3 or 4 of the fully general sort. And this would have less impact on the non-homogenous case. Though Elliot points out that this would be much harder to branch-predict, etc... so maybe not. Maybe worth trying out and profiling?? NOTE: I really have no idea what really goes in in that code -- so I may have this all wrong... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Sun, Mar 5, 2017 at 6:45 PM, Steven D'Aprano <steve@pearwood.info> wrote:
For what it's worth, I suggested this a LONG time ago -- well before Python ideas existed.... I thought a "homogenous sequence" could be rally useful for all sorts of optimizations. (at the time I was writing C extensions, and often converting a lot of list to numpy arrays -- which ARE homogenous sequences) Anyway -- it was roundly rejected by Guido and others no one had any interest in the idea. But maybe now that there is a compelling use-case for the built in object the time is right?? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Mar 6, 2017 at 4:45 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I can also imagine other places where knowing those one or two bits of information about homogeneity might potentially allow speedups: converting lists or tuples to numpy arrays, min, max, sum etc. If this extra one or two bits of information were tracked, and the overhead of doing that was very small, then operations over the collections could benefit regardless of the complexity of the algorithm, so also O(n) operations. Too bad that one common thing done with lists – iterating – does not have obvious benefits from type homogeneity in CPython. —Koos
-- + Koos Zevenhoven + http://twitter.com/k7hoven +

(Summary of results: my patch at https://bugs.python.org/issue28685 makes
[Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com>] list.sort() 30-50%
Would someone please move the patch along? I expect it's my fault it's languished so long, since I'm probably the natural person to review it, but I've been buried under other stuff. But the patch doesn't change anything about the sorting algorithm itself - even shallow knowledge of how timsort works is irrelevant. It's just plugging in a different bottom-level object comparison _function_ when that appears valuable. I've said from the start that it's obvious (to me ;-) ) that it's an excellent tradeoff. At worst it adds one simple (pre)pass over the list doing C-level pointer equality comparisons. That's cheap. The worst-case damage is obviously small, the best-case gain is obviously large, and the best cases are almost certainly far more common than the worst cases in most code. In reviewing my own code, after it was fiddled to work under Python 3 there are no mixed-type lists that are ever sorted. There are lists with complex objects, but whenever those are sorted there's a `key=` argument that reduces the problem to sorting ints or tuples of builtin scalar types. I don't care about anyone else's code ;-) One subtle thing to look at: thread safety. IIRC, the patch plugged the comparison function into a file global. That's obviously hosed if multiple threads sort different kinds of lists simultaneously.

On Sun, Mar 5, 2017 at 9:25 PM Tim Peters <tim.peters@gmail.com> wrote:
Thank you so much for the support! Yes to all of those things!
I'm adding that quote to the next version my poster :)
Wow, that is a *very* good point. I never think about those kinds of things, being a n00b, so thanks for catching that... I'll have to go in an fix that. I'm not sure how, though, because the ISLT macro gets called in a bunch of different functions. The only way I can think of to fix it would be to pass the function pointer as an argument to *all* the functions that use ISLT, which would be pretty messy. What do you think would be the easiest fix? Thanks! Elliot

On Sun, Mar 5, 2017 at 8:33 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
Could we make the file global a table of comparison functions and have each thread reference a position in the table? It would be fine if multiple threads happened to use the position of e.g. the int comparison, just so long as each chose the right slot in the table. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. On Sun, Mar 5, 2017 at 8:33 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 10:25 PM David Mertz <mertz@gnosis.cx> wrote:
I don't think that would solve anything. The problem is that the pre-sort check is carried out in the main sort function, but then lots of smaller functions need to know which compare to use. Since the scope isn't shared, then, how can we inform the smaller functions of which compare to use? I solved this problem by just putting the compare function pointer in global scope. Obviously, however, this is not thread-safe. So the question is, how can the smaller functions be made aware of the results of the pre-sort check? Your proposal would merely replace the problem of communicating the function pointer with the problem of communicating the appropriate index into the function table. The smaller functions wouldn't know which index they're supposed to use unless you passed it in. However, I believe the following *would* work: Suppose it were possible to access an identifier unique to each thread, such as that returned by gettid() on Linux. Then maintain a global table of (unique_thread_id, compare_func) pairs, and simply have the ISLT macro index into the table! A bit of a performance hit, sure, but hopefully not that bad? Certainly better than passing it in every time... the problem is, how can we get a unique identifier for the thread in a platform-independent way? Any ideas?

On Sun, Mar 5, 2017 at 10:45 PM Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
the problem is, how can we get a unique identifier for the thread in a platform-independent way? Any ideas?
Oh, I could probably just copy over code from threading.get_ident()... not sure if the key-value table is a good solution, though.

Another solution: check if there is more than one thread; if there is, then disable optimization. Is sorting in multithreaded programs common enough to warrant adding the complexity to deal with it? On Sun, Mar 5, 2017 at 10:52 PM Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:

On Sun, Mar 5, 2017 at 11:21 PM Jelle Zijlstra <jelle.zijlstra@gmail.com> wrote:
Right, of course. So, clearly, the only safe solution is to just keep everything in local scope and pass the compare function pointer into every function that calls ISLT or IFLT. Too bad. I'll rewrite the patch to implement this and open a new issue on the bug tracker (and close my current issue).

On Mon, Mar 6, 2017 at 3:28 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I think there is another safe solution: Gave up unsafe_object_compare. Compare function of long, float, and unicode must not call list.sort(), and must not release GIL. So all you need is, backup old compare_function before sort, and restore it after sort.

On Sun, Mar 5, 2017 at 11:39 PM INADA Naoki <songofacandy@gmail.com> wrote:
That would definitely work, but I think it's too much of a sacrifice for these reasons: (1) You would have to give up on optimizing tuples (because tuple compares call PyObject_RichCompareBool, which could itself call arbitrary code). That's a real shame, because sorting tuples is common (e.g., in DEAP, you sort tuples of floats representing the fitnesses of individuals). The alternative would be to only optimize tuples of long/string/float. However, I think the code complexity of *that* would be greater than the, admittedly messy, solution of passing in the compare everywhere it's needed. (2) There are lots of types, e.g. bytes, that would fall through the cracks. (3) Correct me if I'm wrong, but this would make us depend on GIL, right? And we don't want to depend on that if we don't have to, right? It seems to me that the only solution here is to go in and add a compare function pointer to each function call. Extremely unfortunate, but necessary.

[Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com>]
Not a solution. Not even close ;-) Even if it made good sense, there's nothing to stop a custom __lt__ method from creating new threads _during_ a sort. The best approach is indeed to pass the function pointer to every location that needs it. Note that a MergeState struct is already created per sort invocation, That isn't a file global for much the same reason. However, it's not just threads that are a potential problem. Suppose a custom __lt__ method, invoked during a sort, does a new sort of its own. That's in the same thread, but may well want a different specialized comparison function. Solve _that_, and "the thread problem" will almost certainly solve itself by magic too. But solving "the thread problem" doesn't necessarily solve "the same-thread reentrancy problem". That's why the MergeState struct is a function local ("auto" in silly C terminology). Since it lives in fresh stack space for each invocation of `listsort()` it solves both the thread and reentrancy problems: every invocation of `listsort()` (regardless of whether from different threads or from the same thread) gets its own MergeState space. You may or may not get simpler code by storing the function pointer as a new member of the MergeState struct. But however that's spelled, it does need to be passed to each function that needs it.

[Chris Angelico <rosuav@gmail.com>]
CPython is full of defensive code protecting against malicious crap. That's why it rarely crashes ;-) def __lt__(self, other): return self.size < other.size Looks harmless? Can't tell! For all we know, there are proxy objects, and other.__getattr__ invokes some elaborate library to open a socket in a new thread to fetch the value of `size` over a network.

On Mon, Mar 6, 2017 at 5:52 PM, Tim Peters <tim.peters@gmail.com> wrote:
Exactly. It's always fun to discover some nice tidy exception that cleanly copes with a ridiculous situation. def gen(): yield next(g) g = gen() next(g) Fortunately in this case, the solution isn't to say "SystemError: cannot create threads while sorting", but even if that were the case, I can't imagine that much production code would be stopped by it. ChrisA

On Sun, Mar 5, 2017 at 11:31 PM Tim Peters <tim.peters@gmail.com> wrote:
Right. It's a real shame, because the patch as it stands right now is extremely surgical, but there's clearly no way around it. There are some functions that don't take in MergeState (e.g. gallop_left) but that call ISLT/IFLT, so I think I'll just add a compare function pointer parameter to all the function calls. I mean, the diff will be a lot hairier, which might make the patch more difficult to review, but when it comes down to it the code complexity won't really increase, so overall I don't think this is the end of the world. Thanks so much for pointing this out! P.S. Is it OK if I close my current issue on the bug tracker and open a new issue, where I'll post the revised patch? The writing on my current issue uses the old, less-rigorous benchmarks, and I feel it would be less confusing if I just made a new issue and posted the correct benchmarks/discussion at the top. The current issue doesn't have many comments, so not much would be lost by just linking to it from the new issue. If this violates bug tracker etiquette, however, :) I'll just post the revised patch on my current issue.

On 3/6/2017 1:56 AM, Elliot Gorokhovsky wrote:
That would be normal. Issues often get revised patches, sometimes with more severe changes than this. Or people post competing patches. Do reverence this thread, and quote Tim's approval in principle, if he did not post on the tracker. -- Terry Jan Reedy

Just submitted a PR implementing this: https://github.com/python/cpython/pull/582 -- just need someone to review it now :) Thanks for all your feedback, everyone! On Sun, Mar 5, 2017 at 12:19 AM Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:

Hi. I may be way off-base here, but having scanned the patch I'm not sure I agree that it's the right way forward. What seems to be happening is that the homogeneity of the list is determined somehow (whether tracked with a hint or scanned just-in-time) and then a specific comparison function for a known subset of built-in types is selected if appropriate. I had assumed that there would be an "apples-to-apples" comparison function in the type structure and that the patch was simply tracking the list's homogeneity in order to enter a (generic) alternative loop to call that function over PyObject_RichCompare(). Why is that not the case? When a new C-level type is introduced (either a built-in or an extension module), why does the list object's code need to know about it in order to perform this optimisation? Why is there not a "tp_apple2apple" slot in the type structure which higher level functions (including the RichCompare() stuff - the first thing that function does is check the type of the objects anyway) can call if it determines that the two objects have the same type? Such a slot would also speed up "contains", "count", etc (for all classes) with no extra work, and no overhead of tracking or scanning the sequence's homogeneity. E.

There is an "apple to apple" compare function. It's unsafe_object_compare. If you look at the pre-sort check, you will find that if the list is homogeneous but not of float, int, string, or tuple, it sets compare_funcs.key_richcompare = ob_type->tp_richcompare and sets compare_funcs.key_compare = unsafe_object_compare. The latter is a wrapper for the former, bypassing the type checks in PyObject_RichCompareBool. Examples of benchmarks that use this functionality include the non-latin string and unbounded int benchmarks. In the table at the bottom of my patch description, it's described as follows: Compare function for general homogeneous lists; just a wrapper for ob_type->tp_richcompare, which is stored by the pre-sort check at compare_funcs.key_richcompare. This yields modest optimization (neighbourhood of 10%), but we generally hope we can do better. Further, in the code, the comments describe it as follows: /* Homogeneous compare: safe for any two compareable objects of the same type. * (compare_funcs.key_richcompare is set to ob_type->tp_richcompare in the * pre-sort check.) */ Does that answer your question? On Thu, Mar 9, 2017 at 6:18 PM Erik <python@lucidity.plus.com> wrote:

On Sat, Mar 4, 2017 at 11:19 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
Thanks for the hard work, this looks very promising.
This point seems wrong on its face. The main point of Timsort is to avoid comparisons in mostly-sorted lists. In makes a lot of difference whether the initial data is random or mostly-sorted. I presume the type homogeneity check takes a fixed O(N) time to perform. In the random case, the sort will take O(N * log N) comparisons. As a multiplier, naive comparisons are clearly more expensive than type check. But whether that type checking overhead is worthwhile obviously depends on the number of comparisons, which might be O(N) if the data is sorted. Real world data tends to be mostly-sorted. So it would be useful for your benchmarks to include: A) Performance on completely sorted data i) Of homogenous type ii) Of mixed type B) Performance on mostly-sorted data i) Of homogenous type ii) Of mixed type -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 10:51 AM David Mertz <mertz@gnosis.cx> wrote:
Thanks for the hard work, this looks very promising.
Thank you!
You are entirely correct, as the benchmarks below demonstrate. I used the benchmark lists from Objects/listsort.txt, which are: \sort: descending data /sort: ascending data 3sort: ascending, then 3 random exchanges +sort: ascending, then 10 random at the end %sort: ascending, then randomly replace 1% of elements w/ random values ~sort: many duplicates =sort: all equal My results are below (the script can be found at https://github.com/embg/python-fastsort-benchmark/blob/master/bench-structur... ): Homogeneous ([int]): \sort: 54.6% /sort: 56.5% 3sort: 53.5% +sort: 55.3% %sort: 52.4% ~sort: 48.0% =sort: 45.2% Heterogeneous ([int]*n + [0.0]): \sort: -17.2% /sort: -19.8% 3sort: -18.0% +sort: -18.8% %sort: -10.0% ~sort: -2.1% =sort: -21.3% As you can see, because there's a lot less non-comparison overhead in the structured lists, the impact of the optimization is much greater, both in performance gain and in worst-case cost. However, I would argue that these data do not invalidate the utility of my patch: the probability of encountering a type-heterogeneous list is certainly less than 5% in practice. So the expected savings, even for structured lists, is something like (5%)(-20%) + (95%)(50%) = 46.5%. And, of course, not *all* the lists one encounters in practice are structured; certainly not *this* structured. So, overall, I would say the numbers above are extremely encouraging. Thanks for pointing out the need for this benchmark, though! Elliot

Good job. I'll read your work later.
Please notice dict is very very special. Since Python is dynamic language, name resolution is done in runtime, not compile time. And most name resolution cause lookup string key from dict. So please don't use string-key dict specialization to rationalize list-sort specialization. Each specialization should be considered carefully about balance between maintenance cost and benefit. I feel 2 (int, string) or 3 (int, string, tuple) special case may be worth enough if code is clean and benefit seems good. But more complex cases can be done in third party libraries, like numpy. In third party library, you can use more powerful optimization like nan-boxing. In Python builtin list, this is hard because we guarantee object identity is not changed. Regards,

On Sun, Mar 5, 2017 at 6:30 PM INADA Naoki <songofacandy@gmail.com> wrote:
Good job. I'll read your work later.
Thanks! So please don't use string-key dict specialization to rationalize
list-sort specialization.
I'm not quite doing that: I agree that dict is a special case because of internal use. I'm just saying that I'm not the first person to try to exploit type homogeneity in data structures in CPython. However, my justification is independent of dict specialization. Namely, it's the following two considerations: (1) lists are idiomatically type-homogeneous. To quote the Python documentation, "Lists are mutable, and *their elements are usually homogeneous* and are accessed by iterating over the list". (2) it's not very common to have to "compare apples and oranges". While it's technically possible to define comparison between any two types you like, and in practice one sometimes compares e.g. ints and floats, in practice it's pretty safe to assume the lists you're sorting are going to be type-homogeneous 95% or 99% of the time. I feel 2 (int, string) or 3 (int, string, tuple) special case may be worth
enough if code is clean and benefit seems good.
I agree -- I only optimized for int, string, float, and tuple. However, my code optimizes sorting for *any* type-homogeneous list as well, by simply replacing PyObject_RichCompareBool with ob_type->richcompare, saving the method lookup time. This is what gives the speedups for non-latin strings and bigints in the benchmarks I shared in my original post: I don't have a special case compare function for them, but by setting the compare function pointer equal to ob_type->richcompare, I can at least save a little bit of method lookup time. In terms of clean code, the special-case compare functions include assertions in #ifdef Py_DEBUG blocks that list all the requirements necessary to make them safe. The pre-sort check then verifies the assumptions for whichever optimized compare is going to be used. So all that's necessary to verify correctness is to convince oneself that (1) the assertions are sufficient and (2) the pre-sort check will never use a compare function for which it cannot prove the assertions. These two points are pretty easy to check: (1) Just plug in the assertions in the original code, e.g. unicode_compare in Objects/unicodeobject.c. Then you have a bunch of if(1) and if(0) blocks, and if you delete the if(0) blocks you'll get exactly what I have in the patch. (2) The pre-sort check code is very simple, and well commented. I would add further that this patch only actually adds one line of code to listsort: assign_compare_function(lo.keys, saved_ob_size). The rest of the patch is just function definitions, and replacing PyObject_RichCompareBool with a function pointer (in the macro for ISLT(X,Y)). Anyway, thanks in advance for your time! Elliot

On Mon, Mar 6, 2017 at 12:53 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I would be rather curious to know how frequently a list consists of "numbers", but a mix of ints and floats. From the point of view of a list's purpose, they're all numbers (point 1 satisfied), and they're all comparable (point 2 satisfied), but from the POV of your patch, it's heterogeneous and suffers a performance penalty. Does it happen a lot in real-world code? (Apologies if this has already been mentioned; I've been skimming the thread, not reading it in detail.) ChrisA

On Sun, Mar 5, 2017 at 7:50 PM Chris Angelico <rosuav@gmail.com> wrote:
This is of course undecidable to verify statically, so we can't just crawl PyPI... however, I would argue that using mixed float-int lists is dangerous, and is more dangerous in Python 3 than in Python 2. So hopefully this is not very common. However, even if 10% (surely a vast overestimate) of sort calls are to mixed int-float lists, my patch would still yield a significant savings on average. (Apologies if this has already been mentioned; I've been skimming the
thread, not reading it in detail.)
It hasn't been mentioned in this thread :)

On Mon, Mar 6, 2017 at 2:03 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I agree that it's dangerous, but it is still common for programmers and Python alike to treat 10 as functionally identical to 10.0 - although as to being more dangerous in Py3, that's much of a muchness (for instance, the single-slash division operator in Py2 can potentially truncate, but in Py3 it's always going to give you a float). But, fair point. I very much doubt it's as high as 10%, so yeah, that would be advantageous. Also, the performance hit is so small, and even that is in the very worst case (a homogeneous list with one different type at the end). I like changes that make stuff run faster. ChrisA

In my experience and/or estimation... well two things: A. Mixtures of floats/ints seem a lot more common than the 10% being thrown around. I don't even consider that bad practice currently (it might become bad practice if Elliot's patch is used, since keeping elements float-only would allow much faster sorting). B. I think a very large percentage of lists are heterogeneous. But most of the time when they are, it's not because there are several different numeric types but rather because a list collects some sort of custom objects. Maybe those even all share some ancestor, but the point is they are user/library defined classes that won't have a fast comparison like ints or floats do. B (part 2): I haven't actually read his patch, but I'm assuming that Elliot's approach can start scanning the list, and as soon as it find a complex/custom object at index 0 ignore the rest of the scan. So for that case, there is very little harm in his linear scan (over one item). On Sun, Mar 5, 2017 at 7:09 PM, Chris Angelico <rosuav@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 8:20 PM David Mertz <mertz@gnosis.cx> wrote:
As long as the ob_type for all the objects is the same, my patch will sort significantly faster, as it replaces PyObject_RichCompareBool with ob_type->tp_richcompare. It doesn't matter if your list is builtin-types or not, as long as the types are all the same, you get a speedup (though it's greater for built-in types). Also, I actually wrote a crawler to see how many PyPI packages implement a custom compare function (like you would have to for user-defined types). The answer is less than 6%. Code: https://github.com/embg/python-fast-listsort/tree/master/crawler
Yup, the pre-sort check breaks out of the loop as soon as it sees heterogeneity. So unless your float is at the end of the whole list of ints (as in my worst-case benchmark), the cost is basically 0.

I see the benchmarks, and while I assume the asymptotic complexity is the same, is there a longer "start-up" time your optimizations need? Do you have benchmarks where you sort 10, 100...10**6 items to show that beyond the types you're sorting, you're not amortizing any increased overhead out to oblivion? On Sun, Mar 5, 2017 at 9:24 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:

On Sun, Mar 5, 2017 at 8:29 PM Nick Timkovich <prometheus235@gmail.com> wrote:
This is addressed in my post to the bug tracker with a perf benchmark (link in the first email in this thread). In short: the pre-sort check is really cheap, even for tiny lists. Though it gets cheaper (percent-wise) as you get bigger. You could also trivially modify the benchmark script I linked to in my original post to check for this.

On 2017-03-06 03:09, Chris Angelico wrote:
Although it's true that both programmers and Python might treat 10 as functionally identical to 10.0, in practice the numbers that are being added to the list probably come from some code that returns integers /or/ floats, rather than a mixture.

On Sun, Mar 5, 2017 at 9:12 PM MRAB <python@mrabarnett.plus.com> wrote:
Yes, exactly. So we can see how the homogeneity assumption is a reasonable one to make; unless you're defining custom compares (uncommon), I don't see where you would ever be sorting a heterogeneous list except for the int/float case.

On Sun, Mar 05, 2017 at 07:19:43AM +0000, Elliot Gorokhovsky wrote:
I sometimes need to know if a list is homogenous, but unfortunately checking large lists for a common type in pure Python is quote slow. Here is a radical thought... why don't lists track their common type themselves? There's only a few methods which can add items: - append - extend - insert - __setitem__ Suppose we gave lists a read-only attrribute, __type_hint__, which returns None for hetrogeneous lists and the type for homogeneous lists. Adding an item to the list does as follows: - if the list is empty, adding an item sets __type_hint__ to type(item); - if the list is not empty, adding an item tests whether type(item) is identical to (not a subclass) of __type_hint__, and if not, sets __type_hint__ to None; - removing an item doesn't change the __type_hint__ unless the list becomes empty, in which case it is reset to None; - if the internal allocated space of the list shrinks, that triggers a recalculation of the __type_hint__ if it is currently None. (There's no need to recalculate the hint if it is not None.) Optional: doing a list.sort() could also recalculate the hint. The effect will be: - if __type_hint__ is a type object, then you can be sure that the list is homogeneous; - if the __type_hint__ is None, then it might still be homogeneous, but it isn't safe to assume so. Not only could sorting take advantage of the type hint without needing to do a separate O(N) scan of the list, but so could other code. I know I would be interested in using this. I have a fair amount of code that has to track the type of any items seen in a list, and swap to a "type agnostic but slow" version if the list is not homogeneous. I could probably replace that with some variation of: if thelist.__type_hint__ is None: process_slow(thelist) else: process_fast(thelist) At the very least, I'd be interested in experimenting with this. Thoughts? -- Steve

This is exactly what I would have done if I didn't think the changes would be too large-scale to ever be accepted (at least if proposed by someone without any experience). This is also what my friend suggested when I explained my project to him. In fact, this is exactly what dictionaries do: at each insert, they check whether a string is being inserted. If yes, then stay in a "string" state, if no, then switch to your None state. Again, the only reason I did it differently is because I wanted to be realistic about what changes I could get accepted. My patch does everything inside the list sort function, and doesn't modify anything outside of that. So it's presumably much easier to review. I mean, I submitted my patch in November, and it still hasn't been accepted :) so presumably something as large-scale as this would have very little chance. One more thing: changing mutating methods to check type would make them non-trivially slower; if you aren't resizing the list, appending should really just be as simple as storing a pointer and incrementing a counter. If we have to check type, that turns two simple things into three simple things, which could have significant performance implications. Most applications of lists *don't* care about type homogeneity (i.e. iterating through the list or whatever), so one would have to justify imposing this cost on all list use just to optimize for the special cases where you *do* care. I'm not sure how one would go about checking if that trade-off is a good one or not: what is the distribution of list use across all Python code? It's not something you can really check statically (it's undecidable if you're being pedantic). With my patch, it's easy: if you aren't sorting, you don't have to pay anything. If you are sorting, you either win or lose, based on whether or not you're sorting type-homogeneous data or not (which you basically always are). If anything, I think my patch could be a starting point for later optimization along these lines, depending on whether the problem raised in the previous paragraph can be addressed. What do you think? On Sun, Mar 5, 2017 at 7:47 PM Steven D'Aprano <steve@pearwood.info> wrote: On Sun, Mar 05, 2017 at 07:19:43AM +0000, Elliot Gorokhovsky wrote:
I sometimes need to know if a list is homogenous, but unfortunately checking large lists for a common type in pure Python is quote slow. Here is a radical thought... why don't lists track their common type themselves? There's only a few methods which can add items: - append - extend - insert - __setitem__ Suppose we gave lists a read-only attrribute, __type_hint__, which returns None for hetrogeneous lists and the type for homogeneous lists. Adding an item to the list does as follows: - if the list is empty, adding an item sets __type_hint__ to type(item); - if the list is not empty, adding an item tests whether type(item) is identical to (not a subclass) of __type_hint__, and if not, sets __type_hint__ to None; - removing an item doesn't change the __type_hint__ unless the list becomes empty, in which case it is reset to None; - if the internal allocated space of the list shrinks, that triggers a recalculation of the __type_hint__ if it is currently None. (There's no need to recalculate the hint if it is not None.) Optional: doing a list.sort() could also recalculate the hint. The effect will be: - if __type_hint__ is a type object, then you can be sure that the list is homogeneous; - if the __type_hint__ is None, then it might still be homogeneous, but it isn't safe to assume so. Not only could sorting take advantage of the type hint without needing to do a separate O(N) scan of the list, but so could other code. I know I would be interested in using this. I have a fair amount of code that has to track the type of any items seen in a list, and swap to a "type agnostic but slow" version if the list is not homogeneous. I could probably replace that with some variation of: if thelist.__type_hint__ is None: process_slow(thelist) else: process_fast(thelist) At the very least, I'd be interested in experimenting with this. Thoughts? -- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Mar 5, 2017 at 6:45 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Here is a radical thought... why don't lists track their common type themselves? There's only a few methods which can add items:
I had exactly the same thought. Lists would need to grow a new attribute, of course. I'm not sure how that would affect the object layout and word boundaries. But there might be free space for another attribute slot. The real question is whether doing this is a win. On each append/mutation operation we would need to do a comparison to the __type_hint__ (assuming Steven's spelling of the attribute). That's not free. Balancing that, however, when we actually *did* a sort, it would be O(1) to tell if it was homogeneous (and also the actual type if yes) rather than O(N). This question isn't really subject to microbenchmarks though. If we added __type_hint__ as None/type-object and added those comparisons to it on .insert()/.append()/etc, then we would be slower by some increment while all we were doing was adding things. There could only be a win when the list is sorted (but a big win for that case). In real world code, how often are lists sorted? Also I'm not really confident that Elliot's estimates of 95% of lists being homogeneous holds, but the speedup he proposes would seem to win even if that percentage is a lot lower than 95%. If only 10% of lists in real world code ever get `my_list.sort()` called on them, Steven's idea is probably not good. If 50% of lists do, it probably is. But then, it depends just how *often* lists that get sorted are re-sorted too. Yours, David... P.S. I think that given that we can .append() then delete items to make a heterogenous list become homogeneous again, Elliot's idea is somewhat orthogonal to Steven's. That is, even if we had .__type_hint__ on lists, it might be a "false None" and it could still be worth doing Elliot's linear scan anyway. On the other hand, the None-ness on non-empty lists might be a good enough predictor of heterogeneity in real world code that the linear scan would almost always be a waste. I do not know without benchmarks against real codebases. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 8:09 PM David Mertz <mertz@gnosis.cx> wrote:
Exactly.
I would imagine that fewer than even 10% of lists in real world code ever get sorted. I mean, just crawl PyPI and look for `.sort()` or `sorted()`; you'll find it's not that common. I know, because I was hoping to be able to demonstrate non-trivial improvements on the benchmark suites, but they just don't sort enough for it to show up. You only see application-level speedups if you're doing a *lot* of sorting, like in DEAP Pareto selection (DEAP is a popular Python evolutionary algorithms library; you sometimes have to sort individuals in the population by fitness, and the population is huge). And the cost of doing the pre-sort check isn't that bad, anyway... I think that making *every* append slower just for sorting/etc would be a net loss. Anyway, like I said earlier, my patch could be a first step towards a more broad optimization like this.

On Sun, Mar 5, 2017 at 7:16 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
I think I must sort things a lot more often than most people :-). But also, it's not just the number of list objects that are sorted, but how often it's done. I could have a 10,000 line program with only one call to `my_list.sort()` in it... but that one line is something that is called inside an inner loop. This feels like it really needs profiling not just static analysis. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 8:23 PM David Mertz <mertz@gnosis.cx> wrote:
Totally -- but what I mean is if you look at the performance benchmark suite, for example, most of the benchmarks do not have the string "sort" in their source code (IIRC). I think that most applications don't spend most of their time sorting. That's not to say sorting isn't important -- I wouldn't have written my patch if I didn't think it is. I'm just saying it probably isn't a good idea to penalize *all* list use just to benefit the minority of list use that involves sorting.

On 06/03/17 03:08, David Mertz wrote:
I don't think anyone has mentioned this yet, but FWIW I think the 'type hint' may need to be tri-state: heterogeneous (NULL), homogeneous (the pointer to the type structure) and also "unknown" (a sentinel value - the address of a static char or something). Otherwise, a delete operation on the list would need to scan the list to work out if it had changed from heterogeneous to homogeneous (unless it was acceptable that once heterogeneous, a list is always considered heterogeneous - i.e., delete always sets the hint to NULL). Instead, delete would change a NULL hint to the sentinel (leaving a valid type hint as it is) and then prior to sorting - as the hint is being checked anyway - if it's the sentinel value, perform the pre-scan that the existing patch is doing to restore the knowledge of just what type of list it is. I'd prefer the sort optimization to be based on what my list contains NOW, not on what it may have contained some time in the past, so I'm not a fan of the "once heterogeneous, always considered heterogeneous" behaviour if it's cheap enough to avoid it. E.

On Tue, Mar 7, 2017 at 1:47 PM Erik <python@lucidity.plus.com> wrote:
Sure. Dictionaries actually don't implement this, though: as soon as they see a non-string key, they permanently switch to a heterogeneous state (IIRC). I think the bigger problem, though, is that most list use does *not* involve sorting, so it would be a shame to impose the non-trivial overhead of type-checking on *all* list use. With dictionaries, you have to type-check inserts anyway (for equality testing), so it's less of a problem. But the fundamental list operations *don't* require type-checking currently, so why add it in? In practice, the pre-sort check is *very* cheap, and its cost is only imposed on sort usage. Anyway, my patch could always be a precursor to a more general optimization along these lines. I'm almost finished fixing the problem Tim identified earlier in this thread; after that, it'll be ready for review!

Hi Elliot, On 07/03/17 21:10, Elliot Gorokhovsky wrote:
I'd be interested to know if this approach had been considered and rejected for dicts - but I think dicts are a bit of a special case anyway. Because they are historically a fundamental building block of the language (for name lookups etc) they are probably more sensitive to small changes than other objects.
Yes, I understand that issue - I just thought I'd mention something that hadn't been pointed out yet _IF_ the idea of a type hint were to be considered (that's the sub-thread I'm replying to). If you're not doing that, then fine - I just wanted to put down things that occurred to me so they were documented (if only for rejection). So, while I'm at it ;), here are some other things I noticed scanning the list object source (again, only if a type hint was considered): * What is the type hint of an empty list? (this probably depends on how naturally the code for all of the type hint checking deals with NULL vs "unknown"). * listextend() - this should do the right thing with the type hint when extending one list with another. * Several other methods ('contains', 'remove', 'count', 'index') also use PyObject_RichCompareBool(). They could also presumably benefit from the same optimisation (perhaps it's not all about sort() - perhaps this gives a little more weight to the idea).
Nice - good job. E.

On Tue, Mar 7, 2017 at 4:36 PM, Erik <python@lucidity.plus.com> wrote:
Good point about list.extend(). I don't think __type_hint__ could help with .__contains__() or .count() or .remove(). E.g.: In [7]: lst = [1.0, 2.0, 1+0j, F(1,1)] In [8]: from fractions import Fraction as F In [9]: lst = [1.0, 2.0, 1+0j, F(1,1)] In [10]: 1 in lst Out[10]: True In [11]: lst.count(1) Out[11]: 3 In [12]: l.index(1) Out[12]: 0 The list has absolutely nothing of the right type. Yet it contains an item, counts things that are equal, finds a position for an equal item. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Hi David, On 07/03/17 22:39, David Mertz wrote:
Sure, but if the needle doesn't have the same type as the (homogeneous) haystack, then the rich comparison would still need to be done as a fallback (and would produce the result you indicate). But if the needle and the homogeneous haystack have the _same_ type, then a more optimised version of the operation can be done. Regards, E.

On Tue, Mar 7, 2017 at 6:27 PM, Erik <python@lucidity.plus.com> wrote:
In [22]: class Eq(int): def __eq__(self, other): return True ....: In [23]: four, five, six = Eq(4), Eq(5), Eq(6) In [24]: lst = [four, five, six] In [25]: lst.count(Eq(7)) Out[25]: 3 How would this work (other than saying "don't do that it's perverse")? -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Tue, Mar 7, 2017 at 4:53 PM David Mertz <mertz@gnosis.cx> wrote:
There would be two needless checks in the equality testing. First, PyObject_RichCompareBool would see if other is a subclass of self, in which case other->tp_richcompare would be used iff. it is non-null. Otherwise, we would check if self->tp_richcompare is non-null, the check would pass, and we would call self.__eq__. See the flow chart on my poster (linked to in the first email on this thread).

On Tue, Mar 07, 2017 at 09:10:00PM +0000, Elliot Gorokhovsky wrote:
I think the bigger problem, though, is that most list use does *not* involve sorting,
There are other uses for a type hint apart from sorting. There may be optimized versions of other functions (sum, math.fsum, ...) and list methods (e.g. count, remove), anything that has to walk the list comparing items). All very hypothetical at the moment, I admit it...
so it would be a shame to impose the non-trivial overhead of type-checking on *all* list use.
You might be right, but my guess is that the overhead isn't quite as big as you may think, at least for some operations. The type-check itself is just a pointer compare, not an issubclass() operation. The types are either identical, or the list is hetrogeneous. True, `mylist[i] = x` should be a quick assignment into an array of pointers, but there's a bounds check to ensure i is within the correct range. Other methods are even more expensive: `mylist[i:j] = [a...z]` has to move list items around, possibly allocate more memory, and update the list length, compared to that the scan of a...z is probably minimal. `mylist.append(x)` is *usually* a simple assignment into the array (plus updating the list length) but every now and again it triggers a resize, which is costly. I don't think we can predict whether this is a nett gain or loss just from first principles.
Anyway, my patch could always be a precursor to a more general optimization along these lines.
Indeed! Even if nothing else comes of this than your patch, thank you! -- Steve

On Tue, Mar 07, 2017 at 08:46:45PM +0000, Erik wrote:
I thought about that and rejected it as an unnecessary complication. Hetrogeneous and unknown might as well be the same state: either way, you cannot use the homogeneous-type optimization. Part of the complexity here is that I'd like this flag to be available to Python code, not just a hidden internal state of the list. But my instinct for this could be wrong -- if anyone is interested to do some experiments on this, it might be that a three-state flag works out better in practice than two-states.
In a later email, you corrected this: a delete operation need not touch the type-hint (except when the last item is deleted, at which point it resets to None/unknown. With a three-state flag, you can make a three-way decision: If the flag is Unknown: - do a O(N) scan to determine whether the list is homogeneous or hetrogeneous, then choose the optimized or unoptimized routine; if the flag is Hetrogeneous: - there is no point doing a scan, so always choose the unoptimized routine; if the flag is a type: - the list is definitely homogeneous, so depending on the type, you may be able to choose an optimized rountine. Compared to that, a two-state flag misses some opportunities to run the optimized routine: - list contains [1, 2, 3, 4, "foo"] so the hint is set to None; - delete the last item; - list is now homogeneous but the hint is still None. But also avoids bothering with an O(N) scan in some situations where the list really is hetrogeneous. So there's both an opportunity cost and a benefit. There may be other opportunities to do the scan, such as when the underlying pre-allocated array resizes, so even with the two-state flag, "unknown" need not stay unknown forever.
I'd prefer the sort optimization to be based on what my list contains NOW, not on what it may have contained some time in the past,
Remember, we're talking about opportunities for applying an optimization here, nothing more. You're not giving up anything: at worst, the ordinary, unoptimized routine will run and you're no worse off than you are today.
It is not just a matter of the cost of tracking three states versus two. It is a matter of the complexity of the interface. I suppose this could be reported to Python code as None, False or the type. Although, I don't know about you, but I know I'll never be able to remember whether None means Unknown and False means Hetrogeneous, or the other way around. Ultimately, this is all very pie-in-the-sky unless somebody tests just how expensive this is and whether the benefit is worthwhile. That's not going to be me: I'm not able to hack on the list C code. -- Steve

On 08/03/17 00:18, Steven D'Aprano wrote:
Knowing it's definitely one of two positive states and not knowing which of those two states it is is not the same thing when it comes to what one can and can't optimize cheaply :) It sort of depends on how cheaply one can track the states though ...
Part of the complexity here is that I'd like this flag to be available to Python code, not just a hidden internal state of the list.
Out of interest, for what purpose? Generally, I thought Python code should not need to worry about low-level optimisations such as this (which are C-Python specific AIUI). A list.is_heterogeneous() method could be implemented if it was necessary, but how would that be used?
O(N) is worst case. Most of the anecdotal evidence in this thread so far seems to suggest that heterogeneous lists are not common. May or may not be true. Empirically, for me, it is true. Who knows? (and there is the question).
You are a little bit - the extra overhead of checking all of this (which is the unknown factor we're all skirting around ATM) costs. So converting a previously-heterogeneous list to a homogeneous list via a delete or whatever has a benefit if the optimisations can then be applied to that list many times in the future (i.e., once it becomes recognised as homogeneous again, it benefits from optimised paths in the interpreter). And of course, all that depends on your use case. It might work out better for one application over another. As you quite rightly point out, it needs someone to measure the alternatives and work out if _overall_ it has a positive impact ...
I didn't think any of this stuff would come back to Python code (I thought we were talking about C-Python specific implementation only). How is this useful to Python code?
Ultimately, this is all very pie-in-the-sky unless somebody tests just how expensive this is and whether the benefit is worthwhile.
I agree. As I said before, I'm just pointing out things I noticed while looking at the current C code which could be picked up on if someone wants to try implementing and benchmarking any of this. It sort of feels like an argument, but I hope we're just violently agreeing on a generally shared goal ;) Regards, E.

On Wed, Mar 08, 2017 at 01:20:19AM +0000, Erik wrote:
I mentioned earlier that I have code which has to track the type of list items, and swaps to a different algorithm when the types are not all the same. In practice, I just run the "mixed types" algorithm regardless, because the cost of doing a scan of the items in pure Python is too expensive relative to the time saved. Moving some of the work into the C infrastructure might change that. I'm not completely sure that this would, in fact, be useful to me, but I'd like the opportunity to experiment. I could try using a list subclass, but again, the cost of doing the type-checking in Python instead of C is (I think) prohibitive. Nevertheless, I understand that the burden of proving the usefulness of this is on me. (Or anyone else that wants to argue the case.)
A list.is_heterogeneous() method could be implemented if it was necessary, but how would that be used?
I would prefer to get the list item's type: if mylist.__type_hint__ is float: # run optimized float version ... elif mylist.__type_hint__ is int: ... else: # unoptimized version
It is also the best and average case. Given a homogenous list of N items, for some arbitrary N between 0 and ∞, you have to look at all N items before knowing that they're all the same type. So for the homogenous case, the best, worst and average are identically O(N). Given a hetrogeneous list of N items where the first difference is found at index K, K can range from 1 through N-1. (By definition, it cannot be at index 0: you can only detect a difference in types after checking *two* items.) The worst case is that you don't find the difference until the last item, which is O(N). The best case is that you find the difference in position 1 and bail out early. On average, you will find the first difference halfway through the list, which makes it O(N/2) but that's just O(N). (Constant factors don't matter.) If you want to call that O(1) for the best hetrogeneous case, I won't argue except to say that's rather against the spirit of Big Oh analysis in my opinion. I think it's more realistic to say its O(N) across all combinations of best/worst/average and homogeneous/hetrogeneous. But of course if you bail out early, the constant multiplier may differ.
That will strongly depend on where the data is coming from, but when it comes to sorting, randomly mixed types will usually fail since comparisons between different types are generally illegal: py> [1, "a"].sort() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unorderable types: str() < int() -- Steve

On 08/03/17 11:07, Steven D'Aprano wrote:
Hmmm. Yes, I guess if the expensive version requires a lot of isinstance() messing or similar for each element then it could be better to have optimized versions for homogeneous lists of ints or strings etc.
If you know the list is homogeneous then the item's type is "type(mylist[0])". Also, having it be a function call gives an obvious place to put the transition from "unknown" to known state if the tri-state hint approach was taken. Otherwise, that would have to be hooked into the attribute access somehow. That's for someone who wants to try implementing it to decide and propose though :) E.

Can you assume that list of of type(list[0]) and use that type's optimised sort? But in the optimised sort code check that the types are as required. If you hit an element that is not of the required type then fall back to the unoptimised sort. So part of list is sorted using optimised code and the sort is completed with the unoptimised code. Is that sematically clean? Barry

On Wed, Mar 8, 2017 at 2:14 PM Barry <barry@barrys-emacs.org> wrote:
Well, how would you tell if you've hit an element that is not of the required type? You'd have to check during every compare, right? And that's precisely what we're trying to avoid! The whole point of my patch is that we do O(nlogn) compares, but only have O(n) elements, so it's much cheaper to do all the type checks in advance, and in the very likely case that our list is homogeneous, switch to an optimized special-case compare function. Even when we only do O(n) compares, my patch is still much faster (see benchmarks higher up in this thread). Why? Well, if you're doing the type checks during the compares, you're doing them across different function calls, with other stuff interspersed in between. So pipeline/branch prediction/caching is less able to optimize away the overhead of the safety checks (I don't know how CPUs work, but one of those is relevant here). With the pre-sort check, it's all in a single iteration through the list, and we're taking the same exact branch every time; it's much faster. Think iterating over a matrix row-wise versus iterating column-wise (not exactly relevant here since that's about cache, but that's the idea. Again, I don't really know how CPUs work). So in short: no, we can't just check as we go, because that's what we're already doing. But we can optimize significantly by checking in advance. I mean, practically speaking, the advance check is basically free. The compare-time checks, in sum, are not, both asymptotically and practically. Best Elliot

On Wed, Mar 8, 2017 at 2:58 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
The whole point of my patch is that we do O(nlogn) compares, but only have O(n) elements, so it's much cheaper to do all the type checks in advance,
I mean, practically speaking, the advance check is basically free. The compare-time checks, in sum, are not, both asymptotically and practically.
hmm -- I know folks like to say that "the constants don't matter", but it seems they do in this case: without pre-checking: O(nlogn) with pre-checking If homogenous: O(n) + O(nlogn) so the pre-checking only adds if you ignore the constants.. But I'm assuming (particularly with locality and branch prediction and all that included) the constant to type-check is much smaller than the constant to compare two unknown types, so: TC*n + KC*n*logn vs UC*n*logn where: TC -- Constant to type check KC -- Constant known compare UC -- Constant unknown type check So if UC > TC/logn + KC Then this optimization make sense. If UC >KC and UC >>> TC, then this all works out. But if n is HUGE, it may end up being slower (maybe more so with cache locality???) Which is why you need to profile all this carefully. So far Elliott's experiments seem to show it works out well. Which doesn't surprise me for built-ins like float and int that have a native compare, but apparently also for more complex types? How about custom classes? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

What it seemed the trick for optimisation is is to compare the type pointer of an object to see if its the same as a type supported by the chosen optimised sort. It was not clear to me that you need to scan the list at the start to make sure its homogeneous. Given that the type check is so cheap will it slow the sort if you do the pointer check in the compare code? I am not suggesting you run rich compare full fat on each compare.
The whole point of my patch is that we do O(nlogn) compares, but only have O(n) elements, so it's much cheaper to do all the type checks in advance, and in the very likely case that our list is homogeneous, switch to an optimized special-case compare function.
So you do O(nlogn)*2 pointer compares with my suggestion it seems? Which is smaller the O(n) pointer checks?
Even when we only do O(n) compares, my patch is still much faster (see benchmarks higher up in this thread). Why? Well, if you're doing the type checks during the compares, you're doing them across different function calls, with other stuff interspersed in between. So pipeline/branch prediction/caching is less able to optimize away the overhead of the safety checks (I don't know how CPUs work, but one of those is relevant here). With the pre-sort check, it's all in a single iteration through the list, and we're taking the same exact branch every time; it's much faster. Think iterating over a matrix row-wise versus iterating column-wise (not exactly relevant here since that's about cache, but that's the idea. Again, I don't really know how CPUs work).
Provided the code is small I think both versions of the algorithm will benefit from cache and branch prediction.
So in short: no, we can't just check as we go, because that's what we're already doing. But we can optimize significantly by checking in advance. I mean, practically speaking, the advance check is basically free. The compare-time checks, in sum, are not, both asymptotically and practically.
I not clear this is a true. But I have not read the code. Barry

On Thu, Mar 9, 2017 at 3:04 AM Barry Scott <barry@barrys-emacs.org> wrote:
So you do O(nlogn)*2 pointer compares with my suggestion it seems? Which is smaller the O(n) pointer checks?
Not sure what you mean here... pretty sure your inequality is backwards? Look -- the point is, we already *do* the pointer checks during every compare. That's the current implementation. I believe my benchmarks are sufficient to prove that my patch is much faster than the current implementation. Which makes sense, because we do one type check per object, as opposed to something like log n per object, and we do them all at once, which I do believe is faster than doing them across calls. Now, you're not entirely wrong -- the current implementation doesn't do the type checks as efficiently as it could. Specifically, it does them once in PyObject_RichCompareBool, and then *again* in ob_type->tp_richcompare (which it has to for safety: ob_type->tp_richcompare doesn't know the arguments have already been type-checked!) You could totally define optimized compares that do the type-checks more efficiently, and you would see a benefit, just not nearly as extreme as the benefit I demonstrate. Thanks again for your feedback! Elliot

On 9 March 2017 at 21:04, Barry Scott <barry@barrys-emacs.org> wrote:
Isn't there already always a scan of the iterable to build the keys array for sorting (even if no key keyword param is specified)? In which case adding the homogenity check there seems like it shouldn't add much overhead at all (I have to say that I was surprised with 10+% reductions in speed in some of the heterogenous TimSort tests for this reason). And could specific richcompares be refactored so there was a "we really know what the types are is, no need to check" version available to sort() (with the typechecking version available for general use/unoptimised sorting)? Tim Delaney

[Tim Delaney <timothy.c.delaney@gmail.com>]
Isn't there already always a scan of the iterable to build the keys array for sorting (even if no key keyword param is specified)?
No - `.sort()` is a list method, and as such has nothing to do with arbitrary iterables, just lists (perhaps you're thinking of the `sorted()` function?). If no `key=` argument is passed, the list guts itself is used (as-is) as the vector of keys.
Those are worst cases where the current sort does very few compares (like just N-1 for a list of length N). Because they do "amazingly" few compares, they're already "amazingly" fast. And they also do little data movement (e.g., none at all for /sort & =sort, and N//2 pointer swaps for \sort). Because of that any new O(N) overhead would make them significantly slower - unless the new overhead pays off by allowing a larger time saving than it costs.(as it does when the list is same-type). There is a "natural" place to insert "same type?" checks: the outer loop of the sort marches over the vector once, left to right, alternately identifying the next natural run, then possibly extending it and/or merging it into previous runs. The checks could be put there instead, but the code would be ugly and more invasive - I wouldn't even bother trying it.
They're not _just_ type-aware. For example, the special function for ints is specialized to integers that happen to fit in one (internal) "digit", and the special function for strings is specialized to those that happen to be stored in PyUnicode_1BYTE_KIND format. Adding such stuff to the public C API would be ... well, for a start, tedious ;-)

On Thu, Mar 9, 2017 at 2:04 AM, Barry Scott <barry@barrys-emacs.org> wrote:
I think you have a point here. IIUC, the current code makes no assumptions about type homogeneity. So it must do something generic at each compare. Perhaps that generic thing (of course, Elliot knows that is) does do a pointer compare, and then something smart and fast for some built-ins. but there is still a few steps: these are both the same type what type are they how do I compare them? do the compare These may well all be fast, but it's still a few steps, and have to be done O(n*log(n)) times IIUC, Elliot's patch does something like: First go through the whole list and do: What is the type of the first item (only once) How do I compare that type (only once) Is everything else in the list the same type ( O(n) ) Then the actual sort: - do the compare ( O(n*log(n)) ) So there is one operation done n times, and one done n*log(n) times, rather than 4 done n*log(n) times, if the constants are about the same, that saves 3n*log(n) - n operations, or -- if n is non-trivial, then n*3*log(n) operations so we go from 4*n*log(n) to n*log(n) about a 4 times speed up -- in theory. with all the other complications of computer performance, only profiling will tell you! (also, it's not 4 times on the whole sort -- just the compare part -- you still need to shuffle the values around, which presumably takes at last as long as each of the above operations) (not sure about all four of those steps, it may be only three -- but still a 3 time speed up) Now Barry's idea: Assume that the list is homogenous, and crap out if it turns out it's not: Start off: What is the type of the first item (only once) How do I compare that type (only once) Do the sort: Is the next item the correct type? O(n*log(n)) Do the compare ( O(n*log(n)) ) So we now have 2 operations to be run O(n*log(n)) -- so not as good at only one, but still better than the 3 or 4 of the fully general sort. And this would have less impact on the non-homogenous case. Though Elliot points out that this would be much harder to branch-predict, etc... so maybe not. Maybe worth trying out and profiling?? NOTE: I really have no idea what really goes in in that code -- so I may have this all wrong... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Sun, Mar 5, 2017 at 6:45 PM, Steven D'Aprano <steve@pearwood.info> wrote:
For what it's worth, I suggested this a LONG time ago -- well before Python ideas existed.... I thought a "homogenous sequence" could be rally useful for all sorts of optimizations. (at the time I was writing C extensions, and often converting a lot of list to numpy arrays -- which ARE homogenous sequences) Anyway -- it was roundly rejected by Guido and others no one had any interest in the idea. But maybe now that there is a compelling use-case for the built in object the time is right?? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Mar 6, 2017 at 4:45 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I can also imagine other places where knowing those one or two bits of information about homogeneity might potentially allow speedups: converting lists or tuples to numpy arrays, min, max, sum etc. If this extra one or two bits of information were tracked, and the overhead of doing that was very small, then operations over the collections could benefit regardless of the complexity of the algorithm, so also O(n) operations. Too bad that one common thing done with lists – iterating – does not have obvious benefits from type homogeneity in CPython. —Koos
-- + Koos Zevenhoven + http://twitter.com/k7hoven +

(Summary of results: my patch at https://bugs.python.org/issue28685 makes
[Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com>] list.sort() 30-50%
Would someone please move the patch along? I expect it's my fault it's languished so long, since I'm probably the natural person to review it, but I've been buried under other stuff. But the patch doesn't change anything about the sorting algorithm itself - even shallow knowledge of how timsort works is irrelevant. It's just plugging in a different bottom-level object comparison _function_ when that appears valuable. I've said from the start that it's obvious (to me ;-) ) that it's an excellent tradeoff. At worst it adds one simple (pre)pass over the list doing C-level pointer equality comparisons. That's cheap. The worst-case damage is obviously small, the best-case gain is obviously large, and the best cases are almost certainly far more common than the worst cases in most code. In reviewing my own code, after it was fiddled to work under Python 3 there are no mixed-type lists that are ever sorted. There are lists with complex objects, but whenever those are sorted there's a `key=` argument that reduces the problem to sorting ints or tuples of builtin scalar types. I don't care about anyone else's code ;-) One subtle thing to look at: thread safety. IIRC, the patch plugged the comparison function into a file global. That's obviously hosed if multiple threads sort different kinds of lists simultaneously.

On Sun, Mar 5, 2017 at 9:25 PM Tim Peters <tim.peters@gmail.com> wrote:
Thank you so much for the support! Yes to all of those things!
I'm adding that quote to the next version my poster :)
Wow, that is a *very* good point. I never think about those kinds of things, being a n00b, so thanks for catching that... I'll have to go in an fix that. I'm not sure how, though, because the ISLT macro gets called in a bunch of different functions. The only way I can think of to fix it would be to pass the function pointer as an argument to *all* the functions that use ISLT, which would be pretty messy. What do you think would be the easiest fix? Thanks! Elliot

On Sun, Mar 5, 2017 at 8:33 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
Could we make the file global a table of comparison functions and have each thread reference a position in the table? It would be fine if multiple threads happened to use the position of e.g. the int comparison, just so long as each chose the right slot in the table. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. On Sun, Mar 5, 2017 at 8:33 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Mar 5, 2017 at 10:25 PM David Mertz <mertz@gnosis.cx> wrote:
I don't think that would solve anything. The problem is that the pre-sort check is carried out in the main sort function, but then lots of smaller functions need to know which compare to use. Since the scope isn't shared, then, how can we inform the smaller functions of which compare to use? I solved this problem by just putting the compare function pointer in global scope. Obviously, however, this is not thread-safe. So the question is, how can the smaller functions be made aware of the results of the pre-sort check? Your proposal would merely replace the problem of communicating the function pointer with the problem of communicating the appropriate index into the function table. The smaller functions wouldn't know which index they're supposed to use unless you passed it in. However, I believe the following *would* work: Suppose it were possible to access an identifier unique to each thread, such as that returned by gettid() on Linux. Then maintain a global table of (unique_thread_id, compare_func) pairs, and simply have the ISLT macro index into the table! A bit of a performance hit, sure, but hopefully not that bad? Certainly better than passing it in every time... the problem is, how can we get a unique identifier for the thread in a platform-independent way? Any ideas?

On Sun, Mar 5, 2017 at 10:45 PM Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
the problem is, how can we get a unique identifier for the thread in a platform-independent way? Any ideas?
Oh, I could probably just copy over code from threading.get_ident()... not sure if the key-value table is a good solution, though.

Another solution: check if there is more than one thread; if there is, then disable optimization. Is sorting in multithreaded programs common enough to warrant adding the complexity to deal with it? On Sun, Mar 5, 2017 at 10:52 PM Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:

On Sun, Mar 5, 2017 at 11:21 PM Jelle Zijlstra <jelle.zijlstra@gmail.com> wrote:
Right, of course. So, clearly, the only safe solution is to just keep everything in local scope and pass the compare function pointer into every function that calls ISLT or IFLT. Too bad. I'll rewrite the patch to implement this and open a new issue on the bug tracker (and close my current issue).

On Mon, Mar 6, 2017 at 3:28 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I think there is another safe solution: Gave up unsafe_object_compare. Compare function of long, float, and unicode must not call list.sort(), and must not release GIL. So all you need is, backup old compare_function before sort, and restore it after sort.

On Sun, Mar 5, 2017 at 11:39 PM INADA Naoki <songofacandy@gmail.com> wrote:
That would definitely work, but I think it's too much of a sacrifice for these reasons: (1) You would have to give up on optimizing tuples (because tuple compares call PyObject_RichCompareBool, which could itself call arbitrary code). That's a real shame, because sorting tuples is common (e.g., in DEAP, you sort tuples of floats representing the fitnesses of individuals). The alternative would be to only optimize tuples of long/string/float. However, I think the code complexity of *that* would be greater than the, admittedly messy, solution of passing in the compare everywhere it's needed. (2) There are lots of types, e.g. bytes, that would fall through the cracks. (3) Correct me if I'm wrong, but this would make us depend on GIL, right? And we don't want to depend on that if we don't have to, right? It seems to me that the only solution here is to go in and add a compare function pointer to each function call. Extremely unfortunate, but necessary.

[Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com>]
Not a solution. Not even close ;-) Even if it made good sense, there's nothing to stop a custom __lt__ method from creating new threads _during_ a sort. The best approach is indeed to pass the function pointer to every location that needs it. Note that a MergeState struct is already created per sort invocation, That isn't a file global for much the same reason. However, it's not just threads that are a potential problem. Suppose a custom __lt__ method, invoked during a sort, does a new sort of its own. That's in the same thread, but may well want a different specialized comparison function. Solve _that_, and "the thread problem" will almost certainly solve itself by magic too. But solving "the thread problem" doesn't necessarily solve "the same-thread reentrancy problem". That's why the MergeState struct is a function local ("auto" in silly C terminology). Since it lives in fresh stack space for each invocation of `listsort()` it solves both the thread and reentrancy problems: every invocation of `listsort()` (regardless of whether from different threads or from the same thread) gets its own MergeState space. You may or may not get simpler code by storing the function pointer as a new member of the MergeState struct. But however that's spelled, it does need to be passed to each function that needs it.

[Chris Angelico <rosuav@gmail.com>]
CPython is full of defensive code protecting against malicious crap. That's why it rarely crashes ;-) def __lt__(self, other): return self.size < other.size Looks harmless? Can't tell! For all we know, there are proxy objects, and other.__getattr__ invokes some elaborate library to open a socket in a new thread to fetch the value of `size` over a network.

On Mon, Mar 6, 2017 at 5:52 PM, Tim Peters <tim.peters@gmail.com> wrote:
Exactly. It's always fun to discover some nice tidy exception that cleanly copes with a ridiculous situation. def gen(): yield next(g) g = gen() next(g) Fortunately in this case, the solution isn't to say "SystemError: cannot create threads while sorting", but even if that were the case, I can't imagine that much production code would be stopped by it. ChrisA

On Sun, Mar 5, 2017 at 11:31 PM Tim Peters <tim.peters@gmail.com> wrote:
Right. It's a real shame, because the patch as it stands right now is extremely surgical, but there's clearly no way around it. There are some functions that don't take in MergeState (e.g. gallop_left) but that call ISLT/IFLT, so I think I'll just add a compare function pointer parameter to all the function calls. I mean, the diff will be a lot hairier, which might make the patch more difficult to review, but when it comes down to it the code complexity won't really increase, so overall I don't think this is the end of the world. Thanks so much for pointing this out! P.S. Is it OK if I close my current issue on the bug tracker and open a new issue, where I'll post the revised patch? The writing on my current issue uses the old, less-rigorous benchmarks, and I feel it would be less confusing if I just made a new issue and posted the correct benchmarks/discussion at the top. The current issue doesn't have many comments, so not much would be lost by just linking to it from the new issue. If this violates bug tracker etiquette, however, :) I'll just post the revised patch on my current issue.

On 3/6/2017 1:56 AM, Elliot Gorokhovsky wrote:
That would be normal. Issues often get revised patches, sometimes with more severe changes than this. Or people post competing patches. Do reverence this thread, and quote Tim's approval in principle, if he did not post on the tracker. -- Terry Jan Reedy

Just submitted a PR implementing this: https://github.com/python/cpython/pull/582 -- just need someone to review it now :) Thanks for all your feedback, everyone! On Sun, Mar 5, 2017 at 12:19 AM Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:

Hi. I may be way off-base here, but having scanned the patch I'm not sure I agree that it's the right way forward. What seems to be happening is that the homogeneity of the list is determined somehow (whether tracked with a hint or scanned just-in-time) and then a specific comparison function for a known subset of built-in types is selected if appropriate. I had assumed that there would be an "apples-to-apples" comparison function in the type structure and that the patch was simply tracking the list's homogeneity in order to enter a (generic) alternative loop to call that function over PyObject_RichCompare(). Why is that not the case? When a new C-level type is introduced (either a built-in or an extension module), why does the list object's code need to know about it in order to perform this optimisation? Why is there not a "tp_apple2apple" slot in the type structure which higher level functions (including the RichCompare() stuff - the first thing that function does is check the type of the objects anyway) can call if it determines that the two objects have the same type? Such a slot would also speed up "contains", "count", etc (for all classes) with no extra work, and no overhead of tracking or scanning the sequence's homogeneity. E.

There is an "apple to apple" compare function. It's unsafe_object_compare. If you look at the pre-sort check, you will find that if the list is homogeneous but not of float, int, string, or tuple, it sets compare_funcs.key_richcompare = ob_type->tp_richcompare and sets compare_funcs.key_compare = unsafe_object_compare. The latter is a wrapper for the former, bypassing the type checks in PyObject_RichCompareBool. Examples of benchmarks that use this functionality include the non-latin string and unbounded int benchmarks. In the table at the bottom of my patch description, it's described as follows: Compare function for general homogeneous lists; just a wrapper for ob_type->tp_richcompare, which is stored by the pre-sort check at compare_funcs.key_richcompare. This yields modest optimization (neighbourhood of 10%), but we generally hope we can do better. Further, in the code, the comments describe it as follows: /* Homogeneous compare: safe for any two compareable objects of the same type. * (compare_funcs.key_richcompare is set to ob_type->tp_richcompare in the * pre-sort check.) */ Does that answer your question? On Thu, Mar 9, 2017 at 6:18 PM Erik <python@lucidity.plus.com> wrote:
participants (17)
-
Barry
-
Barry Scott
-
Chris Angelico
-
Chris Barker
-
David Mertz
-
Elliot Gorokhovsky
-
Erik
-
INADA Naoki
-
Jelle Zijlstra
-
Koos Zevenhoven
-
MRAB
-
Nick Timkovich
-
Pavol Lisy
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney
-
Tim Peters