Re: [Python-ideas] [Python-Dev] Optimizing list.sort() by checking type in advance

[please restrict follow-ups to python-ideas] Let's not get hung up on meta-discussion here - I always thought "massive clusterf**k" was a precise technical term anyway ;-) While timing certainly needs to be done more carefully, it's obvious to me that this approach _should_ pay off significantly when it applies. Comparisons are extraordinarily expensive in Python, precisely because of the maze of test-and-branch code it requires just to figure out which bottom-level comparison function to invoke each time. That's why I spent months of my life (overall) devising a sequence of sorting algorithms for Python that reduced the number of comparisons needed. Note that when Python's current sort was adopted in Java, they still kept a quicksort variant for "unboxed" builtin types. The adaptive merge sort incurs many overheads that often cost more than they save unless comparisons are in fact very expensive compared to the cost of pointer copying (and in Java comparison of unboxed types is cheap). Indeed, for native numeric types, where comparison is dirt cheap, quicksort generally runs faster than mergesort despite that the former does _more_ comparisons (because mergesort does so much more pointer-copying). I had considered something "like this" for Python 2, but didn't pursue it because comparison was defined between virtually any two types (34 < [1], etc), and people were careless about that (both by design and by accident). In Python 3, comparison "blows up" for absurdly mixed types, so specializing for homogeneously-typed lists is a more promising idea on the face of it. The comparisons needed to determine _whether_ a list's objects have a common type is just len(list)-1 C-level pointer comparisons, and so goes fast. So I expect that, when it applies, this would speed even sorting an already-ordered list with at least 2 elements. For a mixed-type list with at least 2 elements, it will always be pure loss. But (a) I expect such lists are uncommon (and especially uncommon in Python 3); and (b) a one-time scan doing C-level pointer comparisons until finding a mismatched type is bound to be a relatively tiny cost compared to the expense of all the "rich comparisons" that follow. So +1 from me on pursuing this. Elliot, please: - Keep this on python-ideas. python-dev is for current issues in Python development, not for speculating about changes. - Open an issue on the tracker: https://bugs.python.org/ - At least browse the info for developers: https://docs.python.org/devguide/ - Don't overlook Lib/test/sortperf.py. As is, it should be a good test of what your approach so far _doesn't_ help, since it sorts only lists of floats (& I don't think you're special-casing them). If the timing results it reports aren't significantly hurt (and I expect they won't be), then add specialization for floats too and gloat about the speedup :-) - I expect tuples will also be worth specializing (complex sort keys are often implemented as tuples). Nice start! :-)

On 11.10.2016 05:02, Tim Peters wrote:
Let's not get hung up on meta-discussion here - I always thought "massive clusterf**k" was a precise technical term anyway ;-)
I thought so as well. ;) http://www.urbandictionary.com/define.php?term=clusterfuck Cheers, Sven

Warning: the contents of this message may be dangerous for readers with heart conditions. So today I looked at PyFloat_RichCompare. I had been scared initially because it was so complicated, I was worried I would miss some important error check or something if I special cased it. So I took the following approach: I replaced all statements of the form PyLong_Check(w) with 0, and all statements of the form PyFloat_Check(w) with 1. Because that's the whole point; we're cutting out type checks. I then cut out all blocks preceded by if (0). So it's definitely safe. And it turns out that if you do that, PyFloat_RichCompare becomes ONE LINE (after we set op=Py_LT)!!! Just return (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) == 0 ? Py_False : Py_True; So I thought, wow, this will give some nice numbers! But I underestimated the power of this optimization. You have no idea. It's crazy. Since we're dealing with floats, I used Tim Peter's benchmark: Lib/test/sortperf.py, just modifying one function: #Old function Tim Peters wrote: def doit_fast(L): t0 = time.perf_counter() L.fastsort() t1 = time.perf_counter() print("%6.2f" % (t1-t0), end=' ') flush() #My function: def doit(L): F = FastList(L) f0 = time.perf_counter() F.fastsort() f1 = time.perf_counter() F = FastList(L) t0 = time.perf_counter() F.sort() t1 = time.perf_counter() print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ') flush() So the benchmarking here is valid. I didn't write it. All I did was modify it to print percent improvement instead of sort time. The numbers below are percent improvement of my sort vs default sort (just clone my repo and run python sortperf.py to verify): i 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort 15 32768 44.11% 54.69% 47.41% 57.61% 50.17% 75.24% 68.03% 65.16% 82.16% 16 65536 14.14% 75.38% 63.44% 56.56% 67.99% 66.19% 50.72% 61.55% 61.87% 17 131072 73.54% 60.52% 60.97% 61.63% 52.55% 49.84% 68.68% 84.12% 65.90% 18 262144 54.19% 55.34% 54.67% 54.13% 55.62% 52.88% 69.30% 74.86% 72.66% 19 524288 55.12% 53.32% 53.77% 54.27% 52.97% 53.53% 67.55% 76.60% 78.56% 20 1048576 55.05% 51.09% 60.05% 50.69% 62.98% 50.20% 66.24% 71.47% 61.40% This is just insane. This is crazy. I didn't write this benchmark, OK, this is a valid benchmark. Tim Peters wrote it. And except for one trial, they all show more than 50% improvement. Some in the 70s and 80s. This is crazy. This is so cool. I just wanted to share this with you guys. I'll submit a patch to bugs.python.org soon; I just have to write a special case comparison for tuples and then I'm basically done. This is so cool!!!!!!!!! 50% man!!!!! Crazy!!!!! Elliot On Mon, Oct 10, 2016 at 9:02 PM Tim Peters <tim.peters@gmail.com> wrote:

On 12 October 2016 at 02:16, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
Not to take away from the potential for speed improvements (which do indeed seem interesting), but I'd ask that folks avoid using mental health terms to describe test results that we find unbelievable. There are plenty of other adjectives we can use, and a text-based medium like email gives us a chance to proofread our posts before we send them. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 11 October 2016 at 17:49, Nick Coghlan <ncoghlan@gmail.com> wrote:
I'd also suggest toning down the rhetoric a bit (all-caps title, "the contents of this message may be dangerous for readers with heart conditions" etc. Your results do seem good, but it's a little hard to work out what you actually did, and how your results were produced, through the hype. It'll be much better when someone else has a means to reproduce your results to confirm them. In all honestly, people have been working on Python's performance for a long time now, and I'm more inclined to think that a 50% speedup is a mistake rather than an opportunity that's been missed for all that time. I'd be happy to be proved wrong, but for now I'm skeptical. Please continue working on this - I'd love my skepticism to be proved wrong! Paul

On 10/11/2016 2:30 PM, Paul Moore wrote:
I triple the motion. In general, all caps = spam or worse and I usually don't even open such posts. Elliot, to me, all caps means IGNORE ME. I suspect this is not what you want.
I'm not, in the same sense, even though Elliot suggested that we should be ;-). His key insight is that if all members of a list have the same type (which is a common 'special case'), then we can replace the general, somewhat convoluted, rich-comparison function, containing at least two type checks, with a faster special-case comparison function without any type checks. Since Python floats wrap machine doubles, I expect that float may have the greatest speedup.
Please continue working on this - I'd love my skepticism to be proved wrong!
It may be the case now that sorting a list of all floats is faster than a mixed list of ints and floats. I expect that it definitely will be with a float comparison function. -- Terry Jan Reedy

I honestly appreciate your work on this. However, don't write as if you were trying to sell something to the people on the mailing list. "INSANE FLOAT PERFORMANCE!!!" seems the title of a bad YouTube video or something lower than that, while you are just trying to tell us about "Relevant performance improvement when sorting floating point numbers", a subject which is much clearer, less intrusive, and more professional than the one you used. Thanks for your efforts, nonetheless. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@mafagafogigante.org

So I got excited here. And the reason why is that I got those numbers *on Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I figured there was probably a problem with they way I was timing, and certainly the gains couldn't be as extreme as they suggested. But this is on a benchmark that's already in the codebase! Here is a detailed explanation of how to reproduce my results, and the circumstances that would lead them to be invalid: ****************************************** To reproduce, just activate a virtualenv, and then clone https://github.com/embg/python-fast-listsort.git. Then python setup.py install and python sortperf.py. Now let's look at what sortperf.py does and how it relates to Tim's benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made three changes: 1. I added an import, "import fastlist". This obviously would not make sorting twice faster. 2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" + " %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again irrelevant. 3. I changed the timing function #from this def doit_fast(L): t0 = time.perf_counter() L.fastsort() t1 = time.perf_counter() print("%6.2f" % (t1-t0), end=' ') flush() #to this def doit(L): F = FastList(L) f0 = time.perf_counter() F.fastsort() f1 = time.perf_counter() F = FastList(L) t0 = time.perf_counter() F.sort() t1 = time.perf_counter() print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ') flush() ******************************************* So what we've shown is that (1) if you trust the existing sorting benchmark and (2) if my modification to doit() doesn't mess anything up (I leave this up to you to judge), then the measurements are as valid. Which is a pretty big deal (50% !!!!!!!), hence my overexcitement. **************************************** Now I'd like to respond to responses (the one I'm thinking of was off-list so I don't want to quote it) I've gotten questioning how it could be possible for such a small optimization, bypassing the typechecks, to possibly have such a large effect, even in theory. Here's my answer: Let's ignore branch prediction and cache for now and just look at a high level. The cost of sorting is related to the cost of a single comparison, because the vast majority of our time (let's say certainly at least 90%, depending on the list) is spent in comparisons. So let's look at the cost of a comparison. Without my optimization, comparisons for floats (that's what this benchmark looks at) go roughly like this: 1. Test type of left and right for PyObject_RichCompare (which costs two pointer dereferences) and compare them. "3 ops" (quotes because counting ops like this is pretty hand-wavy). "2 memory accesses". 2. Get the address of the float compare method from PyFloat_Type->tp_richcompare. "1 op". "1 memory access". 3. Call the function whose address we just got. "1 op". "Basically 0 memory accesses because we count the stack stuff in that 1 op". 4. Test type of left and right again in PyFloat_RichCompare and compare both of them to PyFloat_Type. "4 ops". "2 memory accesses". 5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever. "2 ops". "2 memory accesses". 6. Compare the floats and return. "2 ops". Now let's tally the "cost" (sorry for use of quotes here, just trying to emphasize this is an intuitive, theoretical explanation for the numbers which doesn't take into account the hardware): "13 ops, 7 memory accesses". Here's what it looks like in my code: 1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses". 2. Compare the floats and return. "2 ops". Tally: "4 ops, 2 memory accesses". Now you can argue branch prediction alleviates a lot of this cost, since we're taking the same branches every time. But note that, branch prediction or not, we still have to do all of those memory acceses, and since they're pointers to places all over memory, they miss the cache basically every time (correct me if I'm wrong). So memory-wise, we really are doing something like a 7:2 ratio, and op-wise, perhaps not as bad because of branch prediction, but still, 13:4 is probably bad no matter what's going on in the hardware. Now consider that something like 90% of our time is spent in those steps. Are my numbers really that unbelievable? Thanks for everything, looking forward to writing this up as a nice latex doc with graphs and perf benchmarks and all the other rigorous goodies, as well as a special case cmp func for homogeneous tuples and a simple patch file, Elliot

On 12 October 2016 at 06:58, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
Thanks for the clearer write-up - this is indeed very cool, and it's wonderful to see that the new assumptions permitted by Python 3 getting stricter about cross-type ordering comparisons may lead to speed-ups for certain common kinds of operations (i.e. sorting lists where the sorting keys are builtin immutable types). Once you get to the point of being able to do performance mentions on a CPython build with a modified list.sort() implementation, you'll want to take a look at the modern benchmark suite in https://github.com/python/performance Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Oct 11, 2016 at 9:56 PM Nick Coghlan <ncoghlan@gmail.com> wrote:
Yup, that's the plan. I'm going to implement optimized compares for tuples, then implement this as a CPython build, and then run benchmark suites and write some rigorous benchmarks using perf/timeit.

[Elliot Gorokhovsky]
Warning: the contents of this message may be dangerous for readers with heart conditions.
It appears some people are offended by your exuberance. Ignore them ;-)
Right. And this is why floats will likely be the best case for your approach: all the overheads of finding "the right" bottom-level comparison are weighed against what's basically a single machine instruction to do the actual work of comparing two floats.
return (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) == 0 ? Py_False : Py_True;
In real life, this is better written as: return (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) ? Py_True : Py_False; That is, the comparison to 0 was an artificial complication left over from simplifying the code. However, that code is buggy, in a way that may not show up for a long time. Keeping track of reference counts is crucial. It needs to be more like: PyObject *res: res = (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) ? Py_True : Py_False; Py_INCREF(res); return res;
So I thought, wow, this will give some nice numbers! But I underestimated the power of this optimization. You have no idea. It's crazy.
It's in the ballpark of what I expected :-)
Since we're dealing with floats, I used Tim Peter's benchmark: Lib/test/sortperf.py, just modifying one function:
It's actually Guido's file, although I probably changed it much more recently than he did. As I explained in a different msg, it's _not_ a good benchmark. It's good at distinguishing among "really really good", "not a huge change", and "really really bad", but that's all. At the start, such gross distinctions are all anyone should care about. One thing I'd like to see: do exactly the same, but comment out your float specialization. It's important to see whether sortperf.py says "not a huge change" then (which I expect, but it needs confirmation). That is, it's not enough if some cases get 4x faster if it's also that other cases get much slower.
Numbers in benchmarks always need to be explained, because they're only clear to the person who wrote the code. From your code, you're essentially computing (old_time - new_time) / old_time * 100.0 but in an obfuscated way. So long as the new way is at least as fast, the only possible values are between 0% and 100%. That's important to note. Other people, e.g., measure these things by putting "new_time" in the denominator. Or just compute a ratio, like old_time / new_time. They way you're doing it, e.g., "75%" means the new way took only a quarter of the time of the old way - it means new_time is 75% smaller than old_time. I'm comfortable with that, but it's not the _clearest_ way to display timing differences so dramatic; old_time / new_time would be 4.0 in this case, which is easier to grasp at a glance.
So the benchmarking here is valid.
No, it sucks ;-) But it's perfectly adequate for what I wanted to see from it :-)
If it _didn't_ suck, all the numbers in a column would be about the same :-) A meta-note: when Guido first wrote sortperf.py, machines - and Python! - were much slower. Now sorting 2**15 elements goes so fast that this gross timing approach is especially only good for making the grossest distinctions. On my box today, even the slowest case (*sort on 2**20 elements) takes under half a second. That's "why" the numbers in each column are much more stable across the last 3 rows than across the first 3 rows - the cases in the first 3 rows take so little time that any timing glitch can skew them badly. Note that you can pass arguments to sortperf.py to check any range of power-of-2 values you like, but if you're aware of the pitfalls the output as-is is fine for me.
This is just insane. This is crazy.
Yet nevertheless wholly expected ;-)
... This is so cool.
It is! Congratulations :-)

On 11.10.2016 05:02, Tim Peters wrote:
Let's not get hung up on meta-discussion here - I always thought "massive clusterf**k" was a precise technical term anyway ;-)
I thought so as well. ;) http://www.urbandictionary.com/define.php?term=clusterfuck Cheers, Sven

Warning: the contents of this message may be dangerous for readers with heart conditions. So today I looked at PyFloat_RichCompare. I had been scared initially because it was so complicated, I was worried I would miss some important error check or something if I special cased it. So I took the following approach: I replaced all statements of the form PyLong_Check(w) with 0, and all statements of the form PyFloat_Check(w) with 1. Because that's the whole point; we're cutting out type checks. I then cut out all blocks preceded by if (0). So it's definitely safe. And it turns out that if you do that, PyFloat_RichCompare becomes ONE LINE (after we set op=Py_LT)!!! Just return (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) == 0 ? Py_False : Py_True; So I thought, wow, this will give some nice numbers! But I underestimated the power of this optimization. You have no idea. It's crazy. Since we're dealing with floats, I used Tim Peter's benchmark: Lib/test/sortperf.py, just modifying one function: #Old function Tim Peters wrote: def doit_fast(L): t0 = time.perf_counter() L.fastsort() t1 = time.perf_counter() print("%6.2f" % (t1-t0), end=' ') flush() #My function: def doit(L): F = FastList(L) f0 = time.perf_counter() F.fastsort() f1 = time.perf_counter() F = FastList(L) t0 = time.perf_counter() F.sort() t1 = time.perf_counter() print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ') flush() So the benchmarking here is valid. I didn't write it. All I did was modify it to print percent improvement instead of sort time. The numbers below are percent improvement of my sort vs default sort (just clone my repo and run python sortperf.py to verify): i 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort 15 32768 44.11% 54.69% 47.41% 57.61% 50.17% 75.24% 68.03% 65.16% 82.16% 16 65536 14.14% 75.38% 63.44% 56.56% 67.99% 66.19% 50.72% 61.55% 61.87% 17 131072 73.54% 60.52% 60.97% 61.63% 52.55% 49.84% 68.68% 84.12% 65.90% 18 262144 54.19% 55.34% 54.67% 54.13% 55.62% 52.88% 69.30% 74.86% 72.66% 19 524288 55.12% 53.32% 53.77% 54.27% 52.97% 53.53% 67.55% 76.60% 78.56% 20 1048576 55.05% 51.09% 60.05% 50.69% 62.98% 50.20% 66.24% 71.47% 61.40% This is just insane. This is crazy. I didn't write this benchmark, OK, this is a valid benchmark. Tim Peters wrote it. And except for one trial, they all show more than 50% improvement. Some in the 70s and 80s. This is crazy. This is so cool. I just wanted to share this with you guys. I'll submit a patch to bugs.python.org soon; I just have to write a special case comparison for tuples and then I'm basically done. This is so cool!!!!!!!!! 50% man!!!!! Crazy!!!!! Elliot On Mon, Oct 10, 2016 at 9:02 PM Tim Peters <tim.peters@gmail.com> wrote:

On 12 October 2016 at 02:16, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
Not to take away from the potential for speed improvements (which do indeed seem interesting), but I'd ask that folks avoid using mental health terms to describe test results that we find unbelievable. There are plenty of other adjectives we can use, and a text-based medium like email gives us a chance to proofread our posts before we send them. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 11 October 2016 at 17:49, Nick Coghlan <ncoghlan@gmail.com> wrote:
I'd also suggest toning down the rhetoric a bit (all-caps title, "the contents of this message may be dangerous for readers with heart conditions" etc. Your results do seem good, but it's a little hard to work out what you actually did, and how your results were produced, through the hype. It'll be much better when someone else has a means to reproduce your results to confirm them. In all honestly, people have been working on Python's performance for a long time now, and I'm more inclined to think that a 50% speedup is a mistake rather than an opportunity that's been missed for all that time. I'd be happy to be proved wrong, but for now I'm skeptical. Please continue working on this - I'd love my skepticism to be proved wrong! Paul

On 10/11/2016 2:30 PM, Paul Moore wrote:
I triple the motion. In general, all caps = spam or worse and I usually don't even open such posts. Elliot, to me, all caps means IGNORE ME. I suspect this is not what you want.
I'm not, in the same sense, even though Elliot suggested that we should be ;-). His key insight is that if all members of a list have the same type (which is a common 'special case'), then we can replace the general, somewhat convoluted, rich-comparison function, containing at least two type checks, with a faster special-case comparison function without any type checks. Since Python floats wrap machine doubles, I expect that float may have the greatest speedup.
Please continue working on this - I'd love my skepticism to be proved wrong!
It may be the case now that sorting a list of all floats is faster than a mixed list of ints and floats. I expect that it definitely will be with a float comparison function. -- Terry Jan Reedy

I honestly appreciate your work on this. However, don't write as if you were trying to sell something to the people on the mailing list. "INSANE FLOAT PERFORMANCE!!!" seems the title of a bad YouTube video or something lower than that, while you are just trying to tell us about "Relevant performance improvement when sorting floating point numbers", a subject which is much clearer, less intrusive, and more professional than the one you used. Thanks for your efforts, nonetheless. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@mafagafogigante.org

So I got excited here. And the reason why is that I got those numbers *on Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I figured there was probably a problem with they way I was timing, and certainly the gains couldn't be as extreme as they suggested. But this is on a benchmark that's already in the codebase! Here is a detailed explanation of how to reproduce my results, and the circumstances that would lead them to be invalid: ****************************************** To reproduce, just activate a virtualenv, and then clone https://github.com/embg/python-fast-listsort.git. Then python setup.py install and python sortperf.py. Now let's look at what sortperf.py does and how it relates to Tim's benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made three changes: 1. I added an import, "import fastlist". This obviously would not make sorting twice faster. 2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" + " %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again irrelevant. 3. I changed the timing function #from this def doit_fast(L): t0 = time.perf_counter() L.fastsort() t1 = time.perf_counter() print("%6.2f" % (t1-t0), end=' ') flush() #to this def doit(L): F = FastList(L) f0 = time.perf_counter() F.fastsort() f1 = time.perf_counter() F = FastList(L) t0 = time.perf_counter() F.sort() t1 = time.perf_counter() print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ') flush() ******************************************* So what we've shown is that (1) if you trust the existing sorting benchmark and (2) if my modification to doit() doesn't mess anything up (I leave this up to you to judge), then the measurements are as valid. Which is a pretty big deal (50% !!!!!!!), hence my overexcitement. **************************************** Now I'd like to respond to responses (the one I'm thinking of was off-list so I don't want to quote it) I've gotten questioning how it could be possible for such a small optimization, bypassing the typechecks, to possibly have such a large effect, even in theory. Here's my answer: Let's ignore branch prediction and cache for now and just look at a high level. The cost of sorting is related to the cost of a single comparison, because the vast majority of our time (let's say certainly at least 90%, depending on the list) is spent in comparisons. So let's look at the cost of a comparison. Without my optimization, comparisons for floats (that's what this benchmark looks at) go roughly like this: 1. Test type of left and right for PyObject_RichCompare (which costs two pointer dereferences) and compare them. "3 ops" (quotes because counting ops like this is pretty hand-wavy). "2 memory accesses". 2. Get the address of the float compare method from PyFloat_Type->tp_richcompare. "1 op". "1 memory access". 3. Call the function whose address we just got. "1 op". "Basically 0 memory accesses because we count the stack stuff in that 1 op". 4. Test type of left and right again in PyFloat_RichCompare and compare both of them to PyFloat_Type. "4 ops". "2 memory accesses". 5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever. "2 ops". "2 memory accesses". 6. Compare the floats and return. "2 ops". Now let's tally the "cost" (sorry for use of quotes here, just trying to emphasize this is an intuitive, theoretical explanation for the numbers which doesn't take into account the hardware): "13 ops, 7 memory accesses". Here's what it looks like in my code: 1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses". 2. Compare the floats and return. "2 ops". Tally: "4 ops, 2 memory accesses". Now you can argue branch prediction alleviates a lot of this cost, since we're taking the same branches every time. But note that, branch prediction or not, we still have to do all of those memory acceses, and since they're pointers to places all over memory, they miss the cache basically every time (correct me if I'm wrong). So memory-wise, we really are doing something like a 7:2 ratio, and op-wise, perhaps not as bad because of branch prediction, but still, 13:4 is probably bad no matter what's going on in the hardware. Now consider that something like 90% of our time is spent in those steps. Are my numbers really that unbelievable? Thanks for everything, looking forward to writing this up as a nice latex doc with graphs and perf benchmarks and all the other rigorous goodies, as well as a special case cmp func for homogeneous tuples and a simple patch file, Elliot

On 12 October 2016 at 06:58, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
Thanks for the clearer write-up - this is indeed very cool, and it's wonderful to see that the new assumptions permitted by Python 3 getting stricter about cross-type ordering comparisons may lead to speed-ups for certain common kinds of operations (i.e. sorting lists where the sorting keys are builtin immutable types). Once you get to the point of being able to do performance mentions on a CPython build with a modified list.sort() implementation, you'll want to take a look at the modern benchmark suite in https://github.com/python/performance Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Oct 11, 2016 at 9:56 PM Nick Coghlan <ncoghlan@gmail.com> wrote:
Yup, that's the plan. I'm going to implement optimized compares for tuples, then implement this as a CPython build, and then run benchmark suites and write some rigorous benchmarks using perf/timeit.

[Elliot Gorokhovsky]
Warning: the contents of this message may be dangerous for readers with heart conditions.
It appears some people are offended by your exuberance. Ignore them ;-)
Right. And this is why floats will likely be the best case for your approach: all the overheads of finding "the right" bottom-level comparison are weighed against what's basically a single machine instruction to do the actual work of comparing two floats.
return (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) == 0 ? Py_False : Py_True;
In real life, this is better written as: return (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) ? Py_True : Py_False; That is, the comparison to 0 was an artificial complication left over from simplifying the code. However, that code is buggy, in a way that may not show up for a long time. Keeping track of reference counts is crucial. It needs to be more like: PyObject *res: res = (PyFloat_AS_DOUBLE(v) < PyFloat_AS_DOUBLE(w)) ? Py_True : Py_False; Py_INCREF(res); return res;
So I thought, wow, this will give some nice numbers! But I underestimated the power of this optimization. You have no idea. It's crazy.
It's in the ballpark of what I expected :-)
Since we're dealing with floats, I used Tim Peter's benchmark: Lib/test/sortperf.py, just modifying one function:
It's actually Guido's file, although I probably changed it much more recently than he did. As I explained in a different msg, it's _not_ a good benchmark. It's good at distinguishing among "really really good", "not a huge change", and "really really bad", but that's all. At the start, such gross distinctions are all anyone should care about. One thing I'd like to see: do exactly the same, but comment out your float specialization. It's important to see whether sortperf.py says "not a huge change" then (which I expect, but it needs confirmation). That is, it's not enough if some cases get 4x faster if it's also that other cases get much slower.
Numbers in benchmarks always need to be explained, because they're only clear to the person who wrote the code. From your code, you're essentially computing (old_time - new_time) / old_time * 100.0 but in an obfuscated way. So long as the new way is at least as fast, the only possible values are between 0% and 100%. That's important to note. Other people, e.g., measure these things by putting "new_time" in the denominator. Or just compute a ratio, like old_time / new_time. They way you're doing it, e.g., "75%" means the new way took only a quarter of the time of the old way - it means new_time is 75% smaller than old_time. I'm comfortable with that, but it's not the _clearest_ way to display timing differences so dramatic; old_time / new_time would be 4.0 in this case, which is easier to grasp at a glance.
So the benchmarking here is valid.
No, it sucks ;-) But it's perfectly adequate for what I wanted to see from it :-)
If it _didn't_ suck, all the numbers in a column would be about the same :-) A meta-note: when Guido first wrote sortperf.py, machines - and Python! - were much slower. Now sorting 2**15 elements goes so fast that this gross timing approach is especially only good for making the grossest distinctions. On my box today, even the slowest case (*sort on 2**20 elements) takes under half a second. That's "why" the numbers in each column are much more stable across the last 3 rows than across the first 3 rows - the cases in the first 3 rows take so little time that any timing glitch can skew them badly. Note that you can pass arguments to sortperf.py to check any range of power-of-2 values you like, but if you're aware of the pitfalls the output as-is is fine for me.
This is just insane. This is crazy.
Yet nevertheless wholly expected ;-)
... This is so cool.
It is! Congratulations :-)
participants (7)
-
Bernardo Sulzbach
-
Elliot Gorokhovsky
-
Nick Coghlan
-
Paul Moore
-
Sven R. Kunze
-
Terry Reedy
-
Tim Peters