An enormous amount of research has been done on sorting since the last time I wrote a sort for Python. Major developments have been in two areas: 1. Adaptive sorting. Sorting algorithms are usually tested on random data, but in real life you almost never see random data. Python's sort tries to catch some common cases of near-order via special- casing. The literature has since defined more than 15 formal measures of disorder, and developed algorithms provably optimal in the face of one or more of them. But this is O() optimality, and theoreticians aren't much concerned about how big the constant factor is. Some researchers are up front about this, and toward the end of one paper with "practical" in its title, the author was overjoyed to report that an implementation was only twice as slow as a naive quicksort <wink>. 2. Pushing the worst-case number of comparisons closer to the information-theoretic limit (ceiling(log2(N!))). I don't care much about #2 -- in experiments conducted when it was new, I measured the # of comparisons our samplesort hybrid did on random inputs, and it was never more than 2% over the theoretical lower bound, and typically closer. As N grows large, the expected case provably converges to the theoretical lower bound. There remains a vanishly small chance for a bad case, but nobody has reported one, and at the time I gave up trying to construct one. Back on Earth, among Python users the most frequent complaint I've heard is that list.sort() isn't stable. Alex is always quick to trot out the appropriate DSU (Decorate Sort Undecorate) pattern then, but the extra memory burden for that can be major (a new 2-tuple per list element costs about 32 bytes, then 4 more bytes for a pointer to it in a list, and 12 more bytes that don't go away to hold each non-small index). After reading all those papers, I couldn't resist taking a crack at a new algorithm that might be practical, and have something you might call a non-recursive adaptive stable natural mergesort / binary insertion sort hybrid. In playing with it so far, it has two bad aspects compared to our samplesort hybrid: + It may require temp memory, up to 2*N bytes worst case (one pointer each for no more than half the array elements). + It gets *some* benefit for arrays with many equal elements, but not nearly as much as I was able to hack samplesort to get. In effect, paritioning is very good at moving equal elements close to each other quickly, but merging leaves them spread across any number of runs. This is especially irksome because we're sticking to Py_LT for comparisons, so can't even detect a==b without comparing a and b twice (and then it's a deduction from that not a < b and not b < a). Given the relatively huge cost of comparisons, it's a timing disaster to do that (compare twice) unless it falls out naturally. It was fairly natural to do so in samplesort, but not at all in this sort. It also has good aspects: + It's stable (items that compare equal retain their relative order, so, e.g., if you sort first on zip code, and a second time on name, people with the same name still appear in order of increasing zip code; this is important in apps that, e.g., refine the results of queries based on user input). + The code is much simpler than samplesort's (but I think I can fix that <wink>). + It gets benefit out of more kinds of patterns, and without lumpy special-casing (a natural mergesort has to identify ascending and descending runs regardless, and then the algorithm builds on just that). + Despite that I haven't micro-optimized it, in the random case it's almost as fast as the samplesort hybrid. In fact, it might have been a bit faster had I run tests yesterday (the samplesort hybrid got sped up by 1-2% last night). This one surprised me the most, because at the time I wrote the samplesort hybrid, I tried several ways of coding mergesorts and couldn't make it as fast. + It has no bad cases (O(N log N) is worst case; N-1 compares is best). Here are some typical timings, taken from Python's sortperf.py, over identical lists of floats: Key: *sort: random data \sort: descending data /sort: ascending data 3sort: ascending data but with 3 random exchanges ~sort: many duplicates =sort: all equal !sort: worst case scenario That last one was a worst case for the last quicksort Python had before it grew the samplesort, and it was a very bad case for that. By sheer coincidence, turns out it's an exceptionally good case for the experimental sort: samplesort i 2**i *sort \sort /sort 3sort ~sort =sort !sort 15 32768 0.13 0.01 0.01 0.10 0.04 0.01 0.11 16 65536 0.24 0.02 0.02 0.23 0.08 0.02 0.24 17 131072 0.54 0.05 0.04 0.49 0.18 0.04 0.53 18 262144 1.18 0.09 0.09 1.08 0.37 0.09 1.16 19 524288 2.58 0.19 0.18 2.34 0.76 0.17 2.52 20 1048576 5.58 0.37 0.36 5.12 1.54 0.35 5.46 timsort 15 32768 0.16 0.01 0.02 0.05 0.14 0.01 0.02 16 65536 0.24 0.02 0.02 0.06 0.19 0.02 0.04 17 131072 0.55 0.04 0.04 0.13 0.42 0.04 0.09 18 262144 1.19 0.09 0.09 0.25 0.91 0.09 0.18 19 524288 2.60 0.18 0.18 0.46 1.97 0.18 0.37 20 1048576 5.61 0.37 0.35 1.00 4.26 0.35 0.74 If it weren't for the ~sort column, I'd seriously suggest replacing the samplesort with this. 2*N extra bytes isn't as bad as it might sound, given that, in the absence of massive object duplication, each list element consumes at least 12 bytes (type pointer, refcount and value) + 4 bytes for the list pointer. Add 'em all up and that's a 13% worst-case temp memory overhead.
On Sat, 20 Jul 2002, Tim Peters wrote:
After reading all those papers, I couldn't resist taking a crack at a new algorithm that might be practical, and have something you might call a non-recursive adaptive stable natural mergesort / binary insertion sort hybrid.
Great work, Tim! I've got several Python implementations of stable-sorts that I can now retire.
If it weren't for the ~sort column, I'd seriously suggest replacing the samplesort with this.
If duplicate keys cannot be more efficiently handled, why not add a list.stable_sort() method? That way the user gets to decide if they want the ~sort tax. If that case is fixed later, then there is little harm in having list.sort == list.stable_sort.
2*N extra bytes isn't as bad as it might sound, given that, in the absence of massive object duplication, each list element consumes at least 12 bytes (type pointer, refcount and value) + 4 bytes for the list pointer. Add 'em all up and that's a 13% worst-case temp memory overhead.
It doesn't bother me in the slightest (and I tend to sort big things). 13% is a reasonable trade-off for stability. Thanks, -Kevin -- Kevin Jacobs The OPAL Group - Enterprise Systems Architect Voice: (216) 986-0710 x 19 E-mail: jacobs@theopalgroup.com Fax: (216) 986-0714 WWW: http://www.theopalgroup.com
On Sat, Jul 20, 2002, Tim Peters wrote:
If it weren't for the ~sort column, I'd seriously suggest replacing the samplesort with this. 2*N extra bytes isn't as bad as it might sound, given that, in the absence of massive object duplication, each list element consumes at least 12 bytes (type pointer, refcount and value) + 4 bytes for the list pointer. Add 'em all up and that's a 13% worst-case temp memory overhead.
Any reason the list object can't grow a .stablesort() method? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
[Aahz]
Any reason the list object can't grow a .stablesort() method?
I'm not sure. Python's samplesort implementation is right up there among the most complicated (by any measure) algorithms in the code base, and the mergesort isn't any simpler anymore. Yet another large mass of difficult code can make for a real maintenance burden after I'm dead. Here, guess what this does: static int gallop_left(PyObject *pivot, PyObject** p, int n, PyObject *compare) { int k; int lo, hi; PyObject **pend; assert(pivot && p && n); pend = p+(n-1); lo = 0; hi = -1; for (;;) { IFLT(*(pend - lo), pivot) break; hi = lo; lo = (lo << 1) + 1; if (lo >= n) { lo = n; break; } } lo = n - lo; hi = n-1 - hi; while (lo < hi) { int m = (lo + hi) >> 1; IFLT(p[m], pivot) lo = m+1; else hi = m; } return lo; fail: return -1; } There are 12 other functions that go into this, some less obscure, some more. Change "hi = -1" to "hi = 0" and you'll get a core dump, etc; it's exceedingly delicate, and because truly understanding it essentially requires doing a formal correctness proof, it's difficult to maintain; fight your way to that understanding, and you'll know why it sorts, but still won't have a clue about why it's so fast. I'm disinclined to add more code of this nature unless I can use it to replace code at least as difficult (which samplesort is). An irony is that stable sorts are, by definition, pointless unless you *do* have equal elements, and the many-equal-elements case is the one known case where the new algorithm is much slower than the current one (indeed, I have good reason to suspect it's the only such case, and reasons beyond just that God loves a good joke <wink>). It's OK by me if this were to become Python's only sort. Short of that, I'd be happier contributing the code to a sorting extension module. There are other reasons the latter may be a good idea; e.g., if you know you're sorting C longs, it's not particularly difficult to do that 10x faster than Python's generic list.sort() can do it; ditto if you know you're comparing strings; etc. Exposing the binary insertion sort (which both samplesort and mergesort use) would also be useful to some people (it's a richer variant of bisect.insort_right). I'd prefer that Python-the-language have just one "really good general sort" built in.
"A" == Aahz
writes:
A> Any reason the list object can't grow a .stablesort() method? Because when a user looks at the methods of a list object and sees both .sort() and .stablesort() you now need to explain the difference, and perhaps give some hint as to why you'd want to choose one over the other. Maybe the teachers-of-Python in this crowd can give some insight into whether 1) they'd actually do this or just hand wave past the difference, or 2) whether it would be a burden to teaching. I'm specifically thinking of the non-programmer crowd learning Python. I would think that most naive uses of list.sort() would expect a stable sort and wouldn't care much about any performance penalties involved. I'd put my own uses squarely in the "naive" camp. ;) I'd prefer to see - .sort() actually /be/ a stable sort in the default case - list objects not be burdened with additional sorting methods (that way lies a footing-challenged incline) - provide a module with more advanced sorting options, with functions suitable for list.sort()'s cmpfunc, and with derived classes (perhaps in C) of list for better performance. -Barry
--- "Barry A. Warsaw"
Because when a user looks at the methods of a list object and sees both .sort() and .stablesort() you now need to explain the difference, and perhaps give some hint as to why you'd want to choose one over the other.
Or you could have an optional parameter that defaults to whatever the more sane value should be (probably stable), and when the user stumbles across this parameter they stumble across the docs too. I think Tim's codebloat argument is more compelling. __________________________________________________ Do You Yahoo!? Yahoo! Health - Feel better, live better http://health.yahoo.com
"SG" == Scott Gilbert
writes:
SG> Or you could have an optional parameter that defaults to SG> whatever the more sane value should be (probably stable), and SG> when the user stumbles across this parameter they stumble SG> across the docs too. SG> I think Tim's codebloat argument is more compelling. Except that in http://mail.python.org/pipermail/python-dev/2002-July/026837.html Tim says: "Back on Earth, among Python users the most frequent complaint I've heard is that list.sort() isn't stable." and here http://mail.python.org/pipermail/python-dev/2002-July/026854.html Tim seems <wink> to be arguing against stable sort as being the default due to code bloat. As Tim's Official Sysadmin, I'm only good at channeling him on one subject, albeit probably one he'd deem most important to his life: lunch. So I'm not sure if he's arguing for or against stable sort being the default. ;) -Barry
[Barry Warsaw]
Except that in
http://mail.python.org/pipermail/python-dev/2002-July/026837.html
Tim says:
"Back on Earth, among Python users the most frequent complaint I've heard is that list.sort() isn't stable."
Yes, and because the current samplesort falls back to a stable sort when lists are small, almost everyone who cares about this and tries to guess about stability via trying small examples comes to a wrong conclusion.
and here
http://mail.python.org/pipermail/python-dev/2002-July/026854.html
Tim seems <wink> to be arguing against stable sort as being the default due to code bloat.
I'm arguing there against having two highly complex and long-winded sorting algorithms in the core. Pick one. In favor of samplesort: + It can be much faster in very-many-equal-elements cases (note that ~sort lists have only 4 distinct values, each repeated N/4 times and spread uniformaly across the whole list). + While it requires some extra memory, that lives on the stack and is O(log N). As a result, it can never raise MemoryError unless a comparison function does. + It's never had a bug reported against it (so is stable in a different sense <wink>). In favor of timsort: + It's stable. + The code is more uniform and so potentially easier to grok, and because it has no random component is easier to predict (e.g., it's certain that it has no quadratic-time cases). + It's incredibly faster in the face of many more kinds of mild disorder, which I believe are very common in the real world. As obvious examples, you add an increment of new data to an already- sorted file, or paste together several sorted files. timsort screams in those cases, but they may as well be random to samplesort, and the difference in runtime can easily exceed a factor of 10. A factor of 10 is a rare and wonderful thing in algorithm development. Against timsort: + It can require O(N) temp storage, although the constant is small compared to object sizes. That means it can raise MemoryError even if a comparison function never does. + Very-many-equal-elements cases can be much slower, but that's partly because it *is* stable, and preserving the order of equal elements is exactly what makes stability hard to achieve in a fast sort (samplesort can't be made stable efficiently).
As Tim's Official Sysadmin, I'm only good at channeling him on one subject, albeit probably one he'd deem most important to his life: lunch. So I'm not sure if he's arguing for or against stable sort being the default. ;)
All else being equal, a stable sort is a better choice. Alas, all else isn't equal. If Python had no sort method now, I'd pick timsort with scant hesitation. Speaking of which, is it time for lunch yet <wink>?
If you have access to a good library, you'll enjoy reading the original paper on samplesort; or a scan can be purchased from the ACM: Samplesort: A Sampling Approach to Minimal Storage Tree Sorting W. D. Frazer, A. C. McKellar JACM, Vol. 17, No. 3, July 1970 As in many papers of its time, the algorithm description is English prose and raises more questions than it answers, but the mathematical analysis is extensive. Two things made me laugh out loud: 1. The largest array they tested had 50,000 elements, because that was the practical upper limit given storage sizes at the time. Now that's such a tiny case that even in Python it's hard to time it accurately. 2. They thought about using a different sort method for small buckets, However, the additional storage required for the program would reduce the size of the input sequence which could be accommodated, and hence it is an open question as to whether or not the efficiency of the total sorting process could be improved in this way. In some ways, life was simpler then <wink>. for-example-i-had-more-hair-ly y'rs - tim
In an effort to save time on email (ya, right ...), I wrote up a pretty detailed overview of the "timsort" algorithm. It's attached. all-will-be-revealed-ly y'rs - tim
--- Tim Peters
In an effort to save time on email (ya, right ...), I wrote up a pretty detailed overview of the "timsort" algorithm. It's attached.
all-will-be-revealed-ly y'rs - tim
[Interesting stuff deleted.]
I'm curious if there is any literature that you've come across, or if you've done any experiments with merging more than two parts at a time. So instead of merging like such: A B C D E F G H I J K L AB CD EF GH IJ KL ABCD EFGH IJKL ABCDEFGH IJKL ABCDEFGHIJKL You were to merge A B C D E F G H I J K L ABC DEF GHI JKL ABCDEF GHIJKL ABCDEFGHIJKL (I realize that your merges are based on the lengths of the subsequences, but you get the point.) My thinking is that many machines (probably yours for instance) have a cache that is 4-way associative, so merging only 2 blocks at a time might not be using the cache as well as it could. Also, changing from merging 2 blocks to 3 or 4 blocks at a time would change the number of passes you have to make (the log part of N*log(N)). It's quite possible that this isn't worth the trade off in complexity and space (or your time :-). Keeping track of comparisons that you've already made could get ugly, and your temp space requirement would go from N/2 to possibly 3N/4... But since you're diving so deeply into this problem, I figured I'd throw it out there. OTOH, this could be close to the speedup that heavily optimized FFT algs get when they go from radix-2 to radix-4. Just thinking out loud... __________________________________________________ Do You Yahoo!? Yahoo! Health - Feel better, live better http://health.yahoo.com
[Scott Gilbert]
I'm curious if there is any literature that you've come across, or if you've done any experiments with merging more than two parts at a time.
There's a literal mountain of research on the topic. I recommend "A Meticulous Analysis of Mergesort Programs" Jyrki Katajainen, Jesper Larsson Traff for a careful accounting of all operations that go into one of these beasts. They got the best results (and much better than quicksort) out of a 4-way bottom-up mergesort via very tedious code (e.g., it effectively represents which input run currently has the smallest next key via the program counter, by way of massive code duplication and oodles of gotos); they were afraid to write actual code for an 8-way version <wink>. OTOH, they were sorting random integers, and, e.g., were delighted to increase the # of comparisons when that could save a few other "dirt cheap" operations.
... My thinking is that many machines (probably yours for instance) have a cache that is 4-way associative, so merging only 2 blocks at a time might not be using the cache as well as it could. Also, changing from merging 2 blocks to 3 or 4 blocks at a time would change the number of passes you have to make (the log part of N*log(N)).
It's quite possible that this isn't worth the trade off in complexity and space (or your time :-).
The real reason it's uninteresting to me is that it has no clear applicability to the cases this sort aims at: exploiting significant pre-existing order of various kinds. That leads to unbalanced run lengths when we're lucky, and if I'm merging a 2-element run with a 100,000-element run, high cache associativity isn't of much use. From the timings I showed before, it's clear that "good cases" of pre-existing order take time that depends almost entirely on just the number of comparisons needed; e.g., 3sort and +sort were as fast as /sort, where the latter does nothing but N-1 comparisons in a single left-to-right scan of the array. Comparisons are expensive enough in Python that doing O(log N) additional comparisons in 3sort and +sort, then moving massive amounts of the array around to fit the oddballs in place, costs almosts nothing more in percentage terms. Since these cases are already effectively as fast as a single left-to-right scan, there's simply no potential remaining for significant gain (unless you can speed a single left-to-right scan! that would be way cool). If you think you can write a sort for random Python arrays faster than the samplesort hybrid, be my guest: I'd love to see it! You should be aware that I've been making this challenge for years <wink>. Something to note: I think you have an overly simple view of Python's lists in mind. When we're merging two runs in the timing test, it's not *just* the list memory that's getting scanned. The lists contain pointers *to* float objects. The float objects have to get read up from memory too, and there goes the rest of your 4-way associativity. Indeed, if you read the comments in Lib/test/sortperf.py, you'll find that it performs horrid trickery to ensure that =sort and =sort work on physically distinct float objects; way back when, these particular tests ran much faster, and that turned out to be partly because, e.g., [0.5] * N constructs a list with N pointers to a single float object, and that was much easier on the memory system. We got a really nice slowdown <wink> by forcing N distinct copies of 0.5. In earlier Pythons the comparison also got short-circuited by an early pointer-equality test ("if they're the same object, they must be equal"), but that's not done anymore. A quick run just now showed that =sort still runs significantly quicker if given a list of identical objects; the only explanation left for that appears to be cache effects.
Keeping track of comparisons that you've already made could get ugly,
Most researches have found that a fancy data structure for this is counter-productive: so long as the m in m-way merging isn't ridiculously large, keeping the head entries in a straight vector with m elements runs fastest. But they're not worried about Python's expensive-comparison case. External sorts using m-way merging with large m typically use a selection tree much like a heap to reduce the expense of keeping track (see, e.g., Knuth for details).
and your temp space requirement would go from N/2 to possibly 3N/4... But since you're diving so deeply into this problem, I figured I'd throw it out there.
OTOH, this could be close to the speedup that heavily optimized FFT algs get when they go from radix-2 to radix-4. Just thinking out loud...
I don't think that's comparable. Moving to radix 4 cuts the total number of non-trivial complex multiplies an FFT has to do, and non-trivial complex multiplies are the expensive part of what an FFT does. In contrast, boosting the m in m-way merging doesn't cut the number of comparisons needed at all (to the contrary, if you're not very careful it increases them), and comparisons are what kill sorting routines in Python. The elaborate gimmicks in timsort for doing merges of unbalanced runs do cut the total number of comparisons needed, and that's where the huge wins come from.
--- Tim Peters
"A Meticulous Analysis of Mergesort Programs" Jyrki Katajainen, Jesper Larsson Traff
Thanks for the cool reference. I read a bit of it last night. I ought to know by now that there really isn't much new under the sun...
The real reason it's uninteresting to me is that it has no clear applicability to the cases this sort aims at: exploiting significant pre-existing order of various kinds. [...] (unless you can speed a single left-to-right scan! that would be way cool).
Do a few well placed prefetch instructions buy you anything? The MMU could be grabbing your next pointer while you're doing your current comparison. And of course you could implement it as a macro that evaporates for whatever platforms you didn't care to implement it on. (I need to look it up, but I'm pretty sure you could do this for both VC++ and gcc on recent x86s.)
If you think you can write a sort for random Python arrays faster than the samplesort hybrid, be my guest: I'd love to see it! You should be aware that I've been making this challenge for years <wink>.
You're remarkably good at taunting me. :-) I've spent a little time on a few of these optimization challenges that get posted. One of these days I'll best you... (not this day though)
Something to note: I think you have an overly simple view of Python's lists in mind.
No, I think I understand the model. I just assumed the objects pointed to would be scattered pretty randomly through memory. So statistically they'll step on the same cache lines as your list once in a while, but that it would average out to being less interesting than the adjacent slots in the list. I'm frequently wrong about stuff like this though... Cheers, -Scott __________________________________________________ Do You Yahoo!? Yahoo! Health - Feel better, live better http://health.yahoo.com
On Mon, Jul 22, 2002, Tim Peters wrote:
In an effort to save time on email (ya, right ...), I wrote up a pretty detailed overview of the "timsort" algorithm. It's attached.
It seems pretty clear by now that the new mergesort is going to replace samplesort, but since nobody else has said this, I figured I'd add one more comment: I actually understood your description of the new mergesort algorithm. Unless you can come up with similar docs for samplesort, the Beer Truck scenario dictates that mergesort be the new gold standard. Or to quote Tim Peters, "Complex is better than complicated." -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
[Aahz]
... It seems pretty clear by now that the new mergesort is going to replace samplesort, but since nobody else has said this, I figured I'd add one more comment:
I actually understood your description of the new mergesort algorithm. Unless you can come up with similar docs for samplesort, the Beer Truck scenario dictates that mergesort be the new gold standard.
Or to quote Tim Peters, "Complex is better than complicated."
Good observation! I wish I'd thought of that <wink>. The mergesort is more complex, but it doesn't have so many fiddly little complications obscuring it. There were was an extensive description of the samplesort hybrid in listobject.c's comments, but you know you're in Complication Heaven when you've got to document half a dozen distinct "tuning macros" in hand-wavy terms. The only tuning parameter in the meregesort is MIN_GALLOP, and the tradeoff it makes is explainable.
Quick update. I left off here: samplesort i 2**i *sort \sort /sort 3sort ~sort =sort !sort 15 32768 0.13 0.01 0.01 0.10 0.04 0.01 0.11 16 65536 0.24 0.02 0.02 0.23 0.08 0.02 0.24 17 131072 0.54 0.05 0.04 0.49 0.18 0.04 0.53 18 262144 1.18 0.09 0.09 1.08 0.37 0.09 1.16 19 524288 2.58 0.19 0.18 2.34 0.76 0.17 2.52 20 1048576 5.58 0.37 0.36 5.12 1.54 0.35 5.46 timsort 15 32768 0.16 0.01 0.02 0.05 0.14 0.01 0.02 16 65536 0.24 0.02 0.02 0.06 0.19 0.02 0.04 17 131072 0.55 0.04 0.04 0.13 0.42 0.04 0.09 18 262144 1.19 0.09 0.09 0.25 0.91 0.09 0.18 19 524288 2.60 0.18 0.18 0.46 1.97 0.18 0.37 20 1048576 5.61 0.37 0.35 1.00 4.26 0.35 0.74 With a lot of complication (albeit principled complication), timsort now looks like 15 32768 0.14 0.01 0.01 0.04 0.10 0.01 0.02 16 65536 0.24 0.02 0.02 0.05 0.17 0.02 0.04 17 131072 0.54 0.05 0.04 0.13 0.38 0.04 0.09 18 262144 1.18 0.09 0.09 0.24 0.81 0.09 0.18 19 524288 2.57 0.18 0.18 0.46 1.77 0.18 0.37 20 1048576 5.55 0.37 0.35 0.99 3.81 0.35 0.74 on the same data (tiny improvements in *sort and 3sort, significant improvement in ~sort, huge improvements for some patterns that aren't touched by this test). For contrast and a sanity check, I also implemented Edelkamp and Stiegeler's "Next-to-m" refinement of weak heapsort. If you know what heapsort is, this is weaker <wink>. In the last decade, Dutton had the bright idea that a heap is stronger than you need for sorting: it's enough if you know only that a parent node's value dominates the right child's values, and then ensure that the root node has no left child. That implies the root node has the maximum value in the (weak) heap. It doesn't matter what's in the left child for the other nodes, provided only that they're weak heaps too. The weaker requirements allow faster (but trickier) code for maintaining the weak-heap invariant as sorting proceeds, and in particular it requires far fewer element comparisons than a (strong)heap sort. Edelkamp and Stiegeler complicated this algorithm in several ways to cut the comparisons even more. I stopped at their first refinement, which does a worst-case number of comparisons N*k - 2**k + N - 2*k where k = ceiling(logbase2(N)) so that even the worst case is very good. They have other gimmicks to cut it more (we're close to the theoretical limit here, so don't read too much into "more"!), but the first refinement proved so far from being promising that I dropped it: weakheapsort i 2**i *sort \sort /sort 3sort ~sort =sort !sort 15 32768 0.19 0.12 0.11 0.11 0.11 0.11 0.12 16 65536 0.31 0.26 0.23 0.23 0.24 0.23 0.26 17 131072 0.71 0.55 0.49 0.49 0.51 0.48 0.56 18 262144 1.59 1.15 1.03 1.04 1.08 1.02 1.19 19 524288 3.57 2.43 2.18 2.18 2.27 2.14 2.51 20 1048576 8.01 5.08 4.57 4.58 4.77 4.50 5.29 The number of compares isn't the problem with this. The problem appears to be heapsort's poor cache behavior, leaping around via multiplying and dividing indices by 2. This is exacerbated in weak heapsort because it also requires allocating a bit vector, to attach a "which of my children should I think of as being 'the right child'?" flag to each element, and that also gets accessed in the same kinds of cache-hostile ways at the same time. The samplesort and mergesort variants access memory sequentially. What I haven't accounted for is why weakheapsort appears to get a major benefit from *any* kind of regularity in the input -- *sort is always the worst case on each line, and by far (note that this implementation does no special-casing of any kind, so it must be an emergent property of the core algorithm). If I were a researcher, I bet I could get a good paper out of that <wink>.
Just FYI. I ripped out the complications I added to the mergesort variant that tried to speed many-equal-keys cases, and worked on its core competency (intelligent merging) instead. There's a reason <wink>: this kick started while investigating ways to speed Zope B-Tree operations when they're used as sets, equal keys are impossible in that context, but intelligent merging can really help. So whatever the fate of this sort, some of the code will live on in Zope's B-Tree routines. The result is that non-trivial cases of near-order got a nice boost, while ~sort got even slower again. I added a new test +sort, which replaces the last 10 values of a sorted array with random values. samplesort has a special case for this, limited to a maximum of 15 trailing out-of-order entries. timsort has no special case for this but does it significantly faster than the samplesort hack anyway, has no limit on how many such trailing entries it can exploit, and couldn't care less whether such entries are at the front or the end of the array; I expect it would be (just) a little slower if they were in the middle. As shown below, timsort does a +sort almost as fast as for a wholly-sorted array. Ditto now for 3sort too, which perturbs order by doing 3 random exchanges in a sorted array. It's become a very interesting sort implementation, handling more kinds of near-order at demonstrably supernatural speed than anything else I'm aware of. ~sort isn't an example of near-order. Quite the contrary, it has a number of inversions quadratic in N, and N/4 runs; the only reason ~sort goes faster than *sort now is-- believe it or not --a surprising benefit from a memory optimization. Key: *sort: random data \sort: descending data /sort: ascending data 3sort: ascending, then 3 random exchanges +sort: ascending, then 10 random at the end ~sort: many duplicates =sort: all equal !sort: worst case scenario C:\Code\python\PCbuild>python -O sortperf.py 15 20 1 samplesort i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.18 0.02 0.01 0.14 0.01 0.07 0.01 0.17 16 65536 0.24 0.02 0.02 0.22 0.02 0.08 0.02 0.24 17 131072 0.53 0.05 0.04 0.49 0.05 0.18 0.04 0.52 18 262144 1.16 0.09 0.09 1.06 0.12 0.37 0.09 1.13 19 524288 2.53 0.18 0.17 2.30 0.24 0.74 0.17 2.47 20 1048576 5.47 0.37 0.35 5.17 0.45 1.51 0.35 5.34 timsort i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.17 0.01 0.01 0.01 0.01 0.14 0.01 0.02 16 65536 0.23 0.02 0.02 0.03 0.02 0.21 0.03 0.04 17 131072 0.53 0.04 0.04 0.05 0.04 0.46 0.04 0.09 18 262144 1.16 0.09 0.09 0.12 0.09 1.01 0.08 0.18 19 524288 2.53 0.18 0.17 0.18 0.18 2.20 0.17 0.36 20 1048576 5.48 0.36 0.35 0.36 0.37 4.78 0.35 0.73
One more piece of this puzzle. It's possible that one of {samplesort, timsort} would become unboundedly faster as the cost of comparisons increased over that of Python floats (which all the timings I posted used). Here's a program that would show this if so, using my local Python, where lists have an .msort() method: """ class SlowCmp(object): __slots__ = ['val'] def __init__(self, val): self.val = val def __lt__(self, other): for i in range(SLOW): i*i return self.val < other.val def drive(n): from random import randrange from time import clock as now n10 = n * 10 L = [SlowCmp(randrange(n10)) for i in xrange(n)] L2 = L[:] t1 = now() L.sort() t2 = now() L2.msort() t3 = now() return t2-t1, t3-t2 for SLOW in 1, 2, 4, 8, 16, 32, 64, 128: print "At SLOW value", SLOW for n in range(1000, 10001, 1000): ss, ms = drive(n) print " %6d %6.2f %6.2f %6.2f" % ( n, ss, ms, 100.0*(ss - ms)/ms) """ Here's the tail end of the output, from which I conclude that the number pf comparisons done on random inputs is virtually identical for the two methods; times vary by a fraction of a percent both ways, with no apparent pattern (note that time.clock() has better than microsecond resolution on WIndows, so the times going into the % calculation have more digits than are displayed here): At SLOW value 32 1000 0.22 0.22 -0.05 2000 0.50 0.50 0.10 3000 0.80 0.80 -0.64 4000 1.11 1.10 0.71 5000 1.44 1.45 -0.12 6000 1.77 1.76 0.72 7000 2.10 2.09 0.31 8000 2.43 2.41 0.79 9000 2.78 2.80 -0.58 10000 3.13 3.13 -0.01 At SLOW value 64 1000 0.37 0.38 -1.00 2000 0.83 0.83 0.20 3000 1.33 1.33 -0.15 4000 1.84 1.84 0.05 5000 2.40 2.39 0.38 6000 2.95 2.92 0.97 7000 3.46 3.47 -0.20 8000 4.04 4.01 0.87 9000 4.60 4.63 -0.68 10000 5.19 5.21 -0.33 At SLOW value 128 1000 0.68 0.67 0.37 2000 1.52 1.50 0.99 3000 2.40 2.41 -0.67 4000 3.35 3.32 1.03 5000 4.30 4.32 -0.47 6000 5.32 5.29 0.54 7000 6.27 6.27 0.04 8000 7.29 7.25 0.55 9000 8.37 8.37 -0.03 10000 9.39 9.43 -0.49
FYI, I've been poking at this in the background. The ~sort regression is vastly reduced, via removing special-casing and adding more general adaptivity (if you read the timsort.txt file, the special case for run lengths within a factor of 2 of each other went away, replaced by a more intelligent mix of one-pair-at-a-time versus galloping modes). *sort lost about 1% as a result (one-pair-at-a-time is maximally effective for *sort, but in a random mix every now again the "switch to the less efficient (for it) galloping mode" heuristic triggers by blind luck). There's also a significant systematic regression in timsort's +sort case, although it remains faster (and much more general) than samplesort's special-casing of it; also a mix of small regressions and speedups in 3sort. These are because, to simplify experimenting, I threw out the "copy only the shorter run" gimmick, always copying the left run instead. That hurts +sort systematically, as instead of copying just the 10 oddball elements at the end, it copies the very long run of N-10 elements instead (and as many as N-1 temp pointers can be needed, up from N/2). That's all repairable, it's just a PITA to do it. C:\Code\python\PCbuild>python -O sortperf.py 15 20 1 samplesort i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.18 0.01 0.02 0.11 0.01 0.04 0.01 0.11 16 65536 0.24 0.02 0.02 0.25 0.02 0.08 0.02 0.24 17 131072 0.53 0.05 0.04 0.49 0.05 0.18 0.04 0.52 18 262144 1.16 0.09 0.09 1.06 0.12 0.37 0.09 1.14 19 524288 2.53 0.18 0.17 2.30 0.24 0.75 0.17 2.47 20 1048576 5.48 0.37 0.35 5.18 0.45 1.52 0.35 5.35 timsort i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.17 0.01 0.02 0.01 0.01 0.05 0.01 0.02 16 65536 0.24 0.02 0.02 0.02 0.02 0.09 0.02 0.04 17 131072 0.54 0.05 0.04 0.05 0.05 0.19 0.04 0.09 18 262144 1.17 0.09 0.09 0.10 0.10 0.38 0.09 0.18 19 524288 2.56 0.18 0.17 0.20 0.20 0.79 0.17 0.36 20 1048576 5.54 0.37 0.35 0.37 0.41 1.62 0.35 0.73 In short, there's no real "speed argument" against this anymore (as I said in the first msg of this thread, the ~sort regression was serious -- it's an important case; turns out galloping is very effective at speeding it too, provided that dumbass premature special-casing doesn't stop galloping from trying <wink>).
On Wed, 24 Jul 2002, Tim Peters wrote:
In short, there's no real "speed argument" against this anymore (as I said in the first msg of this thread, the ~sort regression was serious -- it's an important case; turns out galloping is very effective at speeding it too, provided that dumbass premature special-casing doesn't stop galloping from trying <wink>).
This is fantastic work, Tim. I'm all for switching over to timsort as the one standard sort method. -- ?!ng "Most things are, in fact, slippery slopes. And if you start backing off from one thing because it's a slippery slope, who knows where you'll stop?" -- Sean M. Burke
[Tim]
... There's also a significant systematic regression in timsort's +sort case, ... also a mix of small regressions and speedups in 3sort. These are because, to simplify experimenting, ...(and as many as N-1 temp pointers can be needed, up from N/2). That's all repairable, it's just a PITA to do it.
It's repaired, and those glitches went away:
timsort i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.17 0.01 0.02 0.01 0.01 0.05 0.01 0.02 16 65536 0.24 0.02 0.02 0.02 0.02 0.09 0.02 0.04 17 131072 0.54 0.05 0.04 0.05 0.05 0.19 0.04 0.09 18 262144 1.17 0.09 0.09 0.10 0.10 0.38 0.09 0.18 19 524288 2.56 0.18 0.17 0.20 0.20 0.79 0.17 0.36 20 1048576 5.54 0.37 0.35 0.37 0.41 1.62 0.35 0.73
Now at 15 32768 0.17 0.01 0.01 0.01 0.02 0.09 0.01 0.03 16 65536 0.24 0.02 0.02 0.02 0.02 0.09 0.02 0.04 17 131072 0.53 0.05 0.04 0.05 0.05 0.18 0.04 0.09 18 262144 1.17 0.09 0.09 0.10 0.09 0.38 0.09 0.18 19 524288 2.56 0.18 0.18 0.19 0.19 0.78 0.17 0.36 20 1048576 5.53 0.37 0.35 0.36 0.37 1.60 0.35 0.74 In other news, an elf revealed that Perl is moving to an adaptive stable mergesort(!!!harmonic convergence!!!), and sent some cleaned-up source code. The comments reference a non-existent paper, but if I change the title and the year I find it here: Optimistic sorting and information theoretic complexity. Peter McIlroy. SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp 467-474, Austin, Texas, 25-27 January 1993. Jeremy got that for me, and it's an extremely relevant paper. What I've been calling galloping he called "exponential search", and the paper has some great analysis, pretty much thoroughly characterizing the set of permutations for which this kind approach is helpful, and even optimal. It's a large set <wink>. Amazingly, citeseer finds only one reference to this paper, also from 1993, and despite all the work done on adaptive sorting since then. So it's either essentially unknown in the research community, was shot full of holes (but then people would have delighted in citing it just to rub that in <0.5 wink>), or was quickly superceded by a better result (but then ditto!). I'll leave that a mystery. I haven't had time yet to study the Perl code. The timsort algorithm is clearly more frugal with memory: worst-case N/2 temp pointers needed, and, e.g., in +sort it only needs (at most) 10 temp pointers (independent of N). That may or may not be good, though, depending on whether the Perl algorithm makes more effective use of the memory hierarchy; offhand I don't think it does. OTOH, timsort has 4 flavors of galloping and 2 flavors of binary search and 2 merge routines, because the memory-saving gimmick can require merging "from the left" or "from the right", depending on which run is smaller. Doubling the number of helper routines is what "PITA" meant in the quote at the start <wink>. One more bit of news: cross-box performance of this stuff is baffling. Nobody else has tried timsort yet (unless someone who asked for the code tried an earlier version), but there are Many Mysteries just looking at the numbers for /sort under current CVS Python. Recall that /sort is the case where the data is already sorted: it does N-1 compares in one scan, and that's all. For an array with 2**20 distinct floats that takes 0.35 seconds on my Win98SE 866MHz Pentium box, compiled w/ MSVC6. On my Win2K 866MHz Pentium box, compiled w/ MSVC6, it takes 0.58(!) seconds, and indeed all the sort tests take incredibly much longer on the Win2K box. On Fred's faster Pentium box (I forget exactly how fast, >900MHz and <1GHz), using gcc, the sort tests take a lot less time than on my Win2K box, but my Win98SE box is still faster. Another Mystery (still with the current samplesort): on Win98SE, !sort is always a bit faster than *sort. On Win2K and on Fred's box, it's always a bit slower. I'm leaving that a mystery too. I haven't tried timsort on another box yet, and given that my home machine may be supernaturally fast, I'm never going to <wink>.
Tim Peters wrote:
One more bit of news: cross-box performance of this stuff is baffling. Nobody else has tried timsort yet (unless someone who asked for the code tried an earlier version), but there are Many Mysteries just looking at the numbers for /sort under current CVS Python. Recall that /sort is the case where the data is already sorted: it does N-1 compares in one scan, and that's all. For an array with 2**20 distinct floats that takes 0.35 seconds on my Win98SE 866MHz Pentium box, compiled w/ MSVC6. On my Win2K 866MHz Pentium box, compiled w/ MSVC6, it takes 0.58(!) seconds, and indeed all the sort tests take incredibly much longer on the Win2K box. On Fred's faster Pentium box (I forget exactly how fast, >900MHz and <1GHz), using gcc, the sort tests take a lot less time than on my Win2K box, but my Win98SE box is still faster.
Another Mystery (still with the current samplesort): on Win98SE, !sort is always a bit faster than *sort. On Win2K and on Fred's box, it's always a bit slower. I'm leaving that a mystery too. I haven't tried timsort on another box yet, and given that my home machine may be supernaturally fast, I'm never going to <wink>.
I can give it a go on my AMD boxes if you send me the code. They tend to show surprising results as you know :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
On Thu, 25 Jul 2002, Tim Peters wrote:
One more bit of news: cross-box performance of this stuff is baffling.
I'll run tests on the P4 Xeon, Alpha (21164A, 21264), AMD Elan 520, and maybe a few Sparcs, and whatever else I can get my hands on. Just let me know where I can snag the code + test script. -Kevin -- Kevin Jacobs The OPAL Group - Enterprise Systems Architect Voice: (216) 986-0710 x 19 E-mail: jacobs@theopalgroup.com Fax: (216) 986-0714 WWW: http://www.theopalgroup.com
Apart from fine-tuning and rewriting the doc file, I think the mergesort is done. I'm confident that if any bugs remain, I haven't seen them <wink>. A patch against current CVS listobject.c is here: http://www.python.org/sf/587076 Simple instructions for timing exactly the same data I've posted times against are in the patch description (you already have sortperf.py -- it's in Lib/test). This patch doesn't replace samplesort, it adds a new .msort() method, to make comparative timings easier. It also adds an .hsort() method for weak heapsort, because I forgot to delete that code after I gave up on it <wink>. X-platform samplesort timings are interesting as well as samplesort versus mergesort timings. Timings against "real life" sort jobs are especially interesting. Attaching results to the bug report sounds like a good idea to me, so we get a coherent record in one place. Thanks in advance!
Tim Peters wrote:
Apart from fine-tuning and rewriting the doc file, I think the mergesort is done. I'm confident that if any bugs remain, I haven't seen them <wink>. A patch against current CVS listobject.c is here:
http://www.python.org/sf/587076
Simple instructions for timing exactly the same data I've posted times against are in the patch description (you already have sortperf.py -- it's in Lib/test). This patch doesn't replace samplesort, it adds a new .msort() method, to make comparative timings easier. It also adds an .hsort() method for weak heapsort, because I forgot to delete that code after I gave up on it <wink>.
X-platform samplesort timings are interesting as well as samplesort versus mergesort timings. Timings against "real life" sort jobs are especially interesting. Attaching results to the bug report sounds like a good idea to me, so we get a coherent record in one place.
Thanks in advance!
Here's the result for AMD Athlon 1.2GHz/Linux/gcc: Python/Tim-Python> ./python -O Lib/test/sortperf.py 15 20 1 i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.07 0.00 0.01 0.09 0.01 0.03 0.01 0.08 16 65536 0.18 0.02 0.02 0.19 0.03 0.07 0.02 0.20 17 131072 0.43 0.05 0.04 0.46 0.05 0.18 0.05 0.48 18 262144 0.99 0.09 0.10 1.04 0.13 0.40 0.09 1.11 19 524288 2.23 0.19 0.21 2.32 0.24 0.83 0.20 2.46 20 1048576 4.96 0.40 0.40 5.41 0.47 1.72 0.40 5.46 without patch: Python/Tim-Python> ./python -O Lib/test/sortperf.py 15 20 1 i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.08 0.01 0.01 0.09 0.01 0.03 0.00 0.09 16 65536 0.20 0.02 0.01 0.20 0.03 0.07 0.02 0.20 17 131072 0.46 0.06 0.02 0.45 0.05 0.20 0.04 0.49 18 262144 0.99 0.09 0.10 1.09 0.11 0.40 0.12 1.12 19 524288 2.33 0.20 0.20 2.30 0.24 0.83 0.19 2.47 20 1048576 4.89 0.40 0.41 5.37 0.48 1.71 0.38 6.22 -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
[MAL]
Here's the result for AMD Athlon 1.2GHz/Linux/gcc:
Python/Tim-Python> ./python -O Lib/test/sortperf.py 15 20 1 i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.07 0.00 0.01 0.09 0.01 0.03 0.01 0.08 16 65536 0.18 0.02 0.02 0.19 0.03 0.07 0.02 0.20 17 131072 0.43 0.05 0.04 0.46 0.05 0.18 0.05 0.48 18 262144 0.99 0.09 0.10 1.04 0.13 0.40 0.09 1.11 19 524288 2.23 0.19 0.21 2.32 0.24 0.83 0.20 2.46 20 1048576 4.96 0.40 0.40 5.41 0.47 1.72 0.40 5.46
without patch:
Python/Tim-Python> ./python -O Lib/test/sortperf.py 15 20 1 i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.08 0.01 0.01 0.09 0.01 0.03 0.00 0.09 16 65536 0.20 0.02 0.01 0.20 0.03 0.07 0.02 0.20 17 131072 0.46 0.06 0.02 0.45 0.05 0.20 0.04 0.49 18 262144 0.99 0.09 0.10 1.09 0.11 0.40 0.12 1.12 19 524288 2.33 0.20 0.20 2.30 0.24 0.83 0.19 2.47 20 1048576 4.89 0.40 0.41 5.37 0.48 1.71 0.38 6.22
I assume you didn't read the instructions in the patch description: http://www.python.org/sf/587076 The patch doesn't change anything about how list.sort() works, so what you've shown us is the timing variance on your box across two identical runs. To time the new routine, you need to (temporarily) change L.sort() to L.msort() in sortperf.py's doit() function. It's a one-character change, but an important one <wink>.
Tim Peters wrote:
[MAL]
Here's the result for AMD Athlon 1.2GHz/Linux/gcc:
without patch:
Python/Tim-Python> ./python -O Lib/test/sortperf.py 15 20 1 i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.08 0.01 0.01 0.09 0.01 0.03 0.00 0.09 16 65536 0.20 0.02 0.01 0.20 0.03 0.07 0.02 0.20 17 131072 0.46 0.06 0.02 0.45 0.05 0.20 0.04 0.49 18 262144 0.99 0.09 0.10 1.09 0.11 0.40 0.12 1.12 19 524288 2.33 0.20 0.20 2.30 0.24 0.83 0.19 2.47 20 1048576 4.89 0.40 0.41 5.37 0.48 1.71 0.38 6.22
I assume you didn't read the instructions in the patch description:
http://www.python.org/sf/587076
The patch doesn't change anything about how list.sort() works, so what you've shown us is the timing variance on your box across two identical runs. To time the new routine, you need to (temporarily) change L.sort() to L.msort() in sortperf.py's doit() function. It's a one-character change, but an important one <wink>.
Dang. Why don't you distribute a ZIP file which can be dumped onto the standard Python installation ? Here's the .msort() version: Python/Tim-Python> ./python -O sortperf.py 15 20 1 i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.08 0.01 0.01 0.01 0.01 0.03 0.00 0.02 16 65536 0.17 0.02 0.02 0.02 0.02 0.07 0.02 0.06 17 131072 0.41 0.05 0.04 0.05 0.04 0.16 0.04 0.09 18 262144 0.95 0.10 0.10 0.10 0.10 0.33 0.10 0.20 19 524288 2.17 0.20 0.21 0.20 0.21 0.66 0.20 0.44 20 1048576 4.85 0.42 0.40 0.41 0.41 1.37 0.41 0.84 -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
[MAL]
Dang. Why don't you distribute a ZIP file which can be dumped onto the standard Python installation ?
A zip file containing what? And which "standard Python installation"? If someone is on Python-Dev but can't deal with a one-file patch against CVS, I'm not sure what to conclude, except that I don't want to deal with them at this point <wink>.
Here's the .msort() version:
Python/Tim-Python> ./python -O sortperf.py 15 20 1 i 2**i *sort \sort /sort 3sort +sort ~sort =sort !sort 15 32768 0.08 0.01 0.01 0.01 0.01 0.03 0.00 0.02 16 65536 0.17 0.02 0.02 0.02 0.02 0.07 0.02 0.06 17 131072 0.41 0.05 0.04 0.05 0.04 0.16 0.04 0.09 18 262144 0.95 0.10 0.10 0.10 0.10 0.33 0.10 0.20 19 524288 2.17 0.20 0.21 0.20 0.21 0.66 0.20 0.44 20 1048576 4.85 0.42 0.40 0.41 0.41 1.37 0.41 0.84
Thanks! That's more like it. So far I've got the only known box were ~sort is slower under msort (two other sets of timings were attached to the patch; I'll paste yours in too, merging in the smaller numbers from your first report).
Tim Peters wrote:
[MAL]
Dang. Why don't you distribute a ZIP file which can be dumped onto the standard Python installation ?
A zip file containing what? And which "standard Python installation"? If someone is on Python-Dev but can't deal with a one-file patch against CVS, I'm not sure what to conclude, except that I don't want to deal with them at this point <wink>.
Point taken ;-) I meant something like this: Here's a ZIP file. To install take your standard Python CVS download, unzip it on top of it, then run echo "With .sort()" ./python -O sortperf.py 15 20 1 echo "With .msort()" ./python -O timsortperf.py 15 20 1 -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
http://www.python.org/sf/587076 has collected timings on 5 boxes so far. I also noted that msort() gets a 32% speedup on my box when sorting a 1.33-million line snapshot of the Python-Dev archive. This is a puzzler to account for, since you wouldn't think there's significant pre-existing lexicographic order in a file like that. McIlroy noted similar results from experiments on text, PostScript and C source files in his adaptive mergesort (which is why I tried sorting Python-Dev to begin with), but didn't offer a hypothesis. Performance across platforms is a hoot so far, with Neal's box even seeing a ~6% speedup on *sort. Skip's Pentium III acts most like my Pentium III, which shouldn't be surprising. Ours are the only reports where !sort is faster than *sort for samplesort, and also where ~sort under samplesort is faster than ~sort under timsort. ~sort (only 4 distinct values, repeated N/4 times) remains the most puzzling of the tests by far. Relative to its performance under samplesort, sf userid ~sort speedup under timsort (negative means slower) --------- --------------------------------------------------- montanaro -23% tim_one - 6% jacobs99 +18% lemburg +25% nascheme +30% Maybe it's a big win for AMD boxes, and a mixed bag for Intel boxes. Or maybe it's a win for newer boxes, and a loss for older boxes. Or maybe it's a bigger win the higher the clock rate (it hurt the most on the slowest box, and helped the most on the fastest). Since it ends up doing a sequence of perfectly balanced merges from start to finish, I thought perhaps it has to do with OS and/or chip intelligence in read-ahead cache optimizations -- but *sort also ends up doing a sequence of perfectly balanced merges, and doesn't behave at all like ~sort across boxes. ~sort does exercise the galloping code much more than other tests (*sort rarely gets into galloping mode; ~sort never gets out of galloping mode), so maybe it really has most to do with cache design. Whatever, it's starting to look like a no-brainer -- except for the extremely mixed ~sort results, the numbers so far are great.
Tim> Skip's Pentium III acts most like my Pentium III, which shouldn't Tim> be surprising. ... Tim> sf userid ~sort speedup under timsort (negative means slower) Tim> --------- --------------------------------------------------- Tim> montanaro -23% Tim> tim_one - 6% Tim> jacobs99 +18% Tim> lemburg +25% Tim> nascheme +30% I should point out that my PIII is in a laptop. I don't know if it's a so-called mobile Pentium or not. /proc/cpuinfo reports: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 1 cpu MHz : 451.030 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 897.84 It also has separate 16KB L1 I and D caches. From what I was able to glean from a quick glance at a Katmai vs. Coppermine article, the Coppermine's L2 cache is full-speed, on-chip, with a 256-bit wide connection and 8-way set associative cache. Does any of that help explain why my results are similar to Tim's? Skip
[Tim]
... I also noted that msort() gets a 32% speedup on my box when sorting a 1.33-million line snapshot of the Python-Dev archive. This is a puzzler to account for, since you wouldn't think there's significant pre-existing lexicographic order in a file like that. McIlroy noted similar results from experiments on text, PostScript and C source files in his adaptive mergesort (which is why I tried sorting Python-Dev to begin with), but didn't offer a hypothesis.
Just a note to clarify what "the puzzle" here is. msort() may or may not be faster than sort() on random input on a given box due to platform quirks, but that isn't relevant in this case. What McIlroy noted is that the total # of compares done in these cases was significantly less than log2(N!). That just can't happen (except with vanishingly small probability) if the input data is randomly ordered, and under any comparison-based sorting method. The only hypothesis I have is that, for a stable sort, all the instances of a given element are, by definition of stability, already "in sorted order". So, e.g., "\n" is a popular line in text files, and all the occurrences of "\n" are already sorted. msort can exploit that -- and seemingly does. This doesn't necessarily contradict that ~sort happens to run slower on my box under msort, because ~sort is such an extreme case. OK, if I remove all but the first occurrence of each unique line, the # of lines drops to about 600,000. The speedup msort enjoys also drops, to 6.8%. So exploiting duplicates does appear to account for the bulk of it, but not all of it. If, after removing duplicates, I run that through random.shuffle() before sorting, msort suffers an 8% slowdown(!) relative to samplesort. If I shuffle first but don't remove duplicates, msort enjoys a 10% speedup. So it's clear that msort is getting a significant advantage out of the duplicates, but it's not at all clear what else it's exploiting -- only that there is something else, and that it's significant. Now many times has someone posted an alphabetical list of Python's keywords <wink>?
participants (9)
-
Aahz
-
barry@zope.com
-
Ka-Ping Yee
-
Kevin Jacobs
-
M.-A. Lemburg
-
Neil Schemenauer
-
Scott Gilbert
-
Skip Montanaro
-
Tim Peters