Re: [Python-Dev] Tunning binary insertion sort algorithm in Timsort.

On 15-03-08, nha pham wrote:
I don't understand how this is an improvement, since with binary search the idea is that each comparison cuts the remaining list to search in half; i.e., each comparison yields one bit of information. Here, you're spending a comparison to cut the list to search at the element you just inserted, which is probably not right in the middle. If you miss the middle, you're getting on average less than a full bit of information from your comparison, so you're not reducing the remaining search space by as much as you would be if you just compared to the element in the middle of the list.
I have applied my idea on java.util. ComparableTimSort.sort() and testing. The execute time is reduced by 2%-6% with array of random integer.
For all that, though, experiment trumps theory...

Thank you for your comment. I admit that I did not thinking about the proof before. Here is my naive proof. I hope you can give me some comments: ============= # This proof is an empirical thinking and is not completed, it just gives us a closer look. I hope someone can make it more mathematically. In the proof, we assume we are dealing with unique values array (none of them are equal together). Because if they are equal, the "lucky search" can happen and it is obviously not fair. Statement_1: With an array of size N or less than N, we need at most log2(N) comparisons to find a value (or a position, incase the search miss), using the binary search algorithm. proof: This statement is trivia, and I believe, someone outthere already proved it. We can check again by counting manually. let assume we have array of 32 items: 32 => 16 => 8 => 4 => 2 => 1 (5 comparison) how about 24 items (24 < 32): 24 => 12 => 6 => 3 => 2 => 1 (5 comparison) ok, good enough. Let's just believe on it to move on. Statement_2: If we divide an array into two parts, the more unbalanced arrays we divide, the more benefit we get from the binary search algorithm. proof: Let's assume we have an array of 256 items. case1: If we divide in middle: 128 - 128 Now, if we search on the left, it costs log2(128) = 7 If we search on the right, it cost los2(128) = 7 case2: If we divide unbalanced: 32 - 224 Now, if we search on the left, it costs log2(32) = 5 If we search on the right, it cost at max 8 comparisons (based on the statement_1). You may not believe me, so let's count it by hand: 224 => 112 => 56 => 28 => 14 => 7 => 4 => 2 => 1 So, if we search on the left, we win 2 comparisons compare to case1. We search on the right, we lose 1 comparison compare to case1 I call this is a "benefit". case3: What if we divide more unbalanced: 8 - 248 Search on the left: log2(8) = 3 comparisons. Search on the right, it costs at max 8 comparisons. So if we search on the left, we win 4 comparisons. We search on the right, we lose 1 comparisons. It is "more benefit", isnt it? Statement3: Because we are using random array. There is a 50-50 chance that next_X will be bigger or smaller than X. Statement4: We call N is the size of the sorted list, "index" is the position of X in the sorted list. Because the array is random, index has an equal chance to exist in any position in the sorted list. Statement5: Now we build a model based on previous statements: My idea costs 1 comparison (between X and next_X) to devide the array into two unbalanced parts. The old idea costs 1 comparison to divide the array into two balanced parts. Now let's see which one can find position for next_X faster: If index belongs to [N/4 to 3N/4]: we may lose 1 comparison, or we may not lose. If index belongs to [N/8 to N/4] or [3N/4 to 7N/8]: We may lose 1 comparison, or we win 1 comparison. If index belongs to [N/16 to N/8] or [7N/8 to 15N/16]: We may lose 1 comparison, or we win 2 comparison. If index belongs to [N/32 to N/16] or [15N/16 to 31N/32]: We may lose 1 comparison, or we win 3 comparison. If index belongs to [N/64 to N/32] or [31N/32 to 64N/64]: We may lose 1 comparison, or we win 4 comparison. ... and so on. Statement6: Now we apply the model to a real example. Assume that we already has a sorted list with 16 items. And we already know about "index" of X. We can think of it as a gamble game with 16 slots. In every slot, we only can bid 1 dollar (statement4). From slot 5th to slot 12th, we may lose 1, or we may not lose, 50-50 chance. So after a huge play times, probability told us that we will lose (8 x 1)/2 = 4 dollars. For slot 3, slot 4, slot 13, slot 14, We may lose 1, or we win 1. So after a huge play times, We wont lose or win anything. For slot 2, slot 15. We may lose 1, or we win 2. So after a huge play times, we can win (2-1)x2 = 2 dollars. For slot 1, slot 16. We may lose 1, or we win 3. So after a huge play times, we can win 4 dollars. In total, after a huge play times, we win 4 + 2 + 0 -4 = 2 dollars !!!!! You can test with sorted list 32 or 64 items or any number you want, I believe the benefit is even more. Conclusion: The unbalanced model give us more benefit than the balanced model. That means with an array big enough, My idea give more benefit than the old idea. I think the lucky ticket companies is already know about this. It is a shame that I do not know mathematic principle about this problem. If I have something more, I will update my proof at: https://github.com/nhapq/Optimize_binary_insertion_sort/blob/master/proof.tx... ====== Thank you. Nha Pham. On Mon, Mar 9, 2015 at 10:39 AM, Isaac Schwabacher <ischwabacher@wisc.edu> wrote:

[nha pham <phqnha@gmail.com>]
Sorry for the quick message here. It's just a simple point where it will pay not to get off on a wrong foot ;-) Correct: for an array of size N, binary search can require as many as ceiling(log2(N+1)) comparisons. That's because there are N+1 possible results for an array of size N. For example, for an array of size 3, [A, B, C], "the answer" may be "before A", "between A and B", "between B and C", or "after C". 3 elements, 3+1 = 4 possible results. log2(3) comparisons are not enough to distinguish among 4 results. Make it trivial, an array of length 1. Then 1 comparison is obviously necessary and sufficient in all cases. And, indeed, ceiling(log2(1+1)) = 1. log2(1) equals 0, too small. For the rest, I haven't been able to understand your description or your pseudo-code. I'll try harder. Some things clearly aren't doing what you _intend_ them to do. For example, in your Python code, each time through the outer loop you're apparently trying to sort the next CHUNK elements, but you end up appending CHUNK+1 values to data2 (or data3). Or in this part: for i in range(low,high): x = data[i] if x >= data[i-1]: the first time that loop is executed low == 0, and so i == 0 on the first iteration, and so the conditional is if x >= data[0-1] That's referencing data[-1], which is the very last element in data - which has nothing to do with the CHUNK you're trying to sort at the time. So there are a number of errors here, which makes it that much harder to sort out (pun intended <wink>) what you're trying to do. It would help you to add some asserts to verify your code is doing what you _hope_ it's doing. For example, add assert data2[low: high] == sorted(data[low: high]) assert len(data2) == high to the end of your `sample` loop, and similarly for data3 in your `new` loop. Until those asserts all pass, you're not testing code that's actually sorting correctly. Repair the errors and you almost certainly won't find `new` running over 10 times faster than `sample` anymore. I don't know what you _will_ discover, though. If the code doesn't have to sort correctly, there are much simpler ways to make it run _very_ much faster ;-)

Thank you very much. I am very happy that I got a reply from Tim Peter. You are correct, my mistake. The python code should be: for i in range(low+1,high): //because we already add data[low] x = data[i] if x >= data[i-1]: After I fix it, here is the result: random array 10^6: Old binsort: 1.3322 New binsort: 1.0015 ratio: 0.33 You are right, it is not ten times faster anymore. I will update other results soon. I do check the result of two sorting methods many times to make sure they are the same. It is just because I do not know how to put assert into the timeit.Timer class. I am pretty sure about this. I will try to write the proof more clearly, sorry for inconvenience. Thank you very much. Nha Pham. On Mon, Mar 9, 2015 at 9:27 PM, Tim Peters <tim.peters@gmail.com> wrote:

[nha pham <phqnha@gmail.com>]
Thank you very much. I am very happy that I got a reply from Tim Peter.
My pleasure to speak with you too :-)
`assert` is just another Python statement. You simply add it to the code - there's nothing tricky about this. You could, e.g., simply copy and paste the `assert`s I suggested last time. Before you do, trying adding `print index` to your inner loops, and make SIZE much smaller (say, 1000) so you're not overwhelmed with output. You'll be surprised by what you see on the second (and following) CHUNKs. For example, in both `sample` and `new` it will print 900 ninety nine times in a row when doing the last CHUNK. The code still isn't doing what you intend. Until it does, timing it makes little sense :-)
I am pretty sure about this.
Note that I'm talking about the Python code here, the code you run through timeit. You cannot have checked the results of running _that_ specific code, because it doesn't work at all. You may have checked _other_ code many times. We may get to that later, but since I speak Python, I'm not going to understand what you're doing until we have Python code that works ;-)

OK - here's what the current binsort does, ignoring that it skips an already-sorted prefix (if any), and creating a new list instead of modifying in-place: def oldsort(a): from bisect import bisect_right as br assert a result = [a[0]] for i in range(1, len(a)): x = a[i] index = br(result, x) result.insert(index, x) return result And here's my best guess as to what you _intend_ the new version to do. Please confirm that, or, if I'm guessing wrong, please give a Python function that _does_ implement your intent: def newsort(a): from bisect import bisect_right as br assert a oldx = a[0] result = [oldx] index = 0 for i in range(1, len(a)): x = a[i] if x < oldx: index = br(result, x, 0, index) else: index = br(result, x, index + 1) result.insert(index, x) oldx = x return result Now assuming that's right, I don't care about timing it ;-) The only basic question to me is whether it in fact reduces the number of comparisons. So here's an integer wrapper that bumps a global counter whenever it's asked to compare: class IntWrap(object): def __init__(self, i): self.i = i def __cmp__(a, b): global gncmp gncmp += 1 return cmp(a.i, b.i) def __repr__(self): return repr(self.i) Now we can build lists containing that, and get exact comparison counts. To start, for a given length `n`, this counts the total number of comparisons needed to sort all possible permutations of a list of length n, under both the old and new ways: def drive(n): import itertools global gncmp base = [IntWrap(i) for i in range(n)] oldcount = newcount = 0 numperms = 0 for p in itertools.permutations(base): numperms += 1 gncmp = 0 oldresult = oldsort(p) oldcount += gncmp gncmp = 0 newresult = newsort(p) newcount += gncmp assert oldresult == newresult == base print 'n', n, 'perms', numperms print 'old compares', oldcount print 'new compares', newcount print 'diff %', (newcount - oldcount) * 100.0 / max(oldcount, 1) And, finally, a tiny loop to drive it: for n in range(1, 13): print drive(n) It's still running as I type this, but the results aren't promising so far - as soon as the list length gets non-trivial, the new way requires more comparisons than the old way so far: n 1 perms 1 old compares 0 new compares 0 diff % 0.0 n 2 perms 2 old compares 2 new compares 2 diff % 0.0 n 3 perms 6 old compares 16 new compares 16 diff % 0.0 n 4 perms 24 old compares 112 new compares 116 diff % 3.57142857143 n 5 perms 120 old compares 848 new compares 880 diff % 3.77358490566 n 6 perms 720 old compares 7008 new compares 7296 diff % 4.1095890411 n 7 perms 5040 old compares 63456 new compares 66432 diff % 4.68986384266 n 8 perms 40320 old compares 628608 new compares 662496 diff % 5.39095907147 n 9 perms 362880 old compares 6826752 new compares 7202304 diff % 5.50118123523 n 10 perms 3628800 old compares 80605440 new compares 85006080 diff % 5.45948263542 I believe it would be very difficult to analyze this rigorously - and even if I saw an analysis it would be hard to trust it. Raw counts from simple code are hard to argue with ;-) FYI, here are two ideas I had way back when, but didn't pursue: 1. Merge "2 at a time" instead of just 1. That is, first "sort" the next 2 elements to be merged (1 compare and a possible swap). Then binary search to find where the smaller belongs, and a shorter binary search to find where the larger belongs. Then shift both into place. This can win on two counts: A. Less data movement, since the already-sorted values after the larger element get moved only once instead of twice. B. A possible cut in number of compares. Merging a sorted list of N elements with a sorted list of 2 elements has been studied a lot (e.g., search for "optimal merging of 2 elements" and find the paper by Hwang and Lin). The minimum average theoretical number of compares needed is ceiling(log2((N+2)*(N+1)/2)). 2. Instead of binary insertion sort, do an ordinary (but optimized) bottom-up merge sort. That may not cut the number of compares, but would slash worst-case data movement cost from O(n**2) to O(n*log(n)). As to why your code is sometimes faster, for the Python code in your timing harness, well, that didn't actually sort anything, so wasn't measuring anything interesting (or even explainable ;-) ). For the Java code, I have no guess - I don't know enough about Java internals. Maybe "lucky" data, maybe cache effects, maybe a mistake - don't know, and can't guess. Or maybe my guess (above) at the intent of your code is all wrong. It _is_ an interesting idea, though! Thanks for bringing it up :-)

Hi Tim, On 10 March 2015 at 18:22, Tim Peters <tim.peters@gmail.com> wrote:
Good idea, but when I tried that it seemed to increase the total number of comparisons (on random inputs with only up to 136 items). The increase is on the order of 5%. I'm not sure reduced data movement can compensate for that in Python. Test and code available here: https://bitbucket.org/arigo/arigo/src/default/hack/pypy-hack/list_sort/ The change to insert two items at a time is here: https://bitbucket.org/arigo/arigo/commits/68e04d143dc242cfd9e3934451321f685a... (This is taken straight from PyPy's code.) A bientôt, Armin.

[Tim]
[Armin]
Which is another way of saying "bad idea" - that must be why I didn't pursue it to begin with ;-) Thanks for trying! I plugged a similar routine into the code I showed before to count the # of comparisons in Nha Pham's idea, and this "merge 2 at a time" thing has a higher average # of compares (over all permutations) than Nha's (which in turn has a higher average than the status quo). That makes some sense, thinking about what they do. Nha's algorithm has some supernaturally good cases (input array already ascending or already descending), but "merge 2 at a time" doesn't appear to have any. In any case, the information-theoretic minimum average number of comparisons for merging N sorted elements with 2 sorted elements is ("where do the 2 belong in the final list of N+2 elements?" = comb(N+2, 2)): log2((N+2)*(N+1)/2) = log2(N+2) + log2(N+1) - 1 Add a comparison to get the 2 elements in order to begin with, and we're up to log2(N+2) + log2(N+1) Two independent binary inserts (first to a list of size N, and then to a list of size N+1) comes out to the same. So even being supernaturally clever can't reduce the average number of compares this way. And since, in context, we're only looking at short arrays, a marginal saving in data movement costs (which are still O(N**2) worst case) are unlikely to be significant. Still, if anyone wants to go nuts ... ;-)

Thank you for your comment. I admit that I did not thinking about the proof before. Here is my naive proof. I hope you can give me some comments: ============= # This proof is an empirical thinking and is not completed, it just gives us a closer look. I hope someone can make it more mathematically. In the proof, we assume we are dealing with unique values array (none of them are equal together). Because if they are equal, the "lucky search" can happen and it is obviously not fair. Statement_1: With an array of size N or less than N, we need at most log2(N) comparisons to find a value (or a position, incase the search miss), using the binary search algorithm. proof: This statement is trivia, and I believe, someone outthere already proved it. We can check again by counting manually. let assume we have array of 32 items: 32 => 16 => 8 => 4 => 2 => 1 (5 comparison) how about 24 items (24 < 32): 24 => 12 => 6 => 3 => 2 => 1 (5 comparison) ok, good enough. Let's just believe on it to move on. Statement_2: If we divide an array into two parts, the more unbalanced arrays we divide, the more benefit we get from the binary search algorithm. proof: Let's assume we have an array of 256 items. case1: If we divide in middle: 128 - 128 Now, if we search on the left, it costs log2(128) = 7 If we search on the right, it cost los2(128) = 7 case2: If we divide unbalanced: 32 - 224 Now, if we search on the left, it costs log2(32) = 5 If we search on the right, it cost at max 8 comparisons (based on the statement_1). You may not believe me, so let's count it by hand: 224 => 112 => 56 => 28 => 14 => 7 => 4 => 2 => 1 So, if we search on the left, we win 2 comparisons compare to case1. We search on the right, we lose 1 comparison compare to case1 I call this is a "benefit". case3: What if we divide more unbalanced: 8 - 248 Search on the left: log2(8) = 3 comparisons. Search on the right, it costs at max 8 comparisons. So if we search on the left, we win 4 comparisons. We search on the right, we lose 1 comparisons. It is "more benefit", isnt it? Statement3: Because we are using random array. There is a 50-50 chance that next_X will be bigger or smaller than X. Statement4: We call N is the size of the sorted list, "index" is the position of X in the sorted list. Because the array is random, index has an equal chance to exist in any position in the sorted list. Statement5: Now we build a model based on previous statements: My idea costs 1 comparison (between X and next_X) to devide the array into two unbalanced parts. The old idea costs 1 comparison to divide the array into two balanced parts. Now let's see which one can find position for next_X faster: If index belongs to [N/4 to 3N/4]: we may lose 1 comparison, or we may not lose. If index belongs to [N/8 to N/4] or [3N/4 to 7N/8]: We may lose 1 comparison, or we win 1 comparison. If index belongs to [N/16 to N/8] or [7N/8 to 15N/16]: We may lose 1 comparison, or we win 2 comparison. If index belongs to [N/32 to N/16] or [15N/16 to 31N/32]: We may lose 1 comparison, or we win 3 comparison. If index belongs to [N/64 to N/32] or [31N/32 to 64N/64]: We may lose 1 comparison, or we win 4 comparison. ... and so on. Statement6: Now we apply the model to a real example. Assume that we already has a sorted list with 16 items. And we already know about "index" of X. We can think of it as a gamble game with 16 slots. In every slot, we only can bid 1 dollar (statement4). From slot 5th to slot 12th, we may lose 1, or we may not lose, 50-50 chance. So after a huge play times, probability told us that we will lose (8 x 1)/2 = 4 dollars. For slot 3, slot 4, slot 13, slot 14, We may lose 1, or we win 1. So after a huge play times, We wont lose or win anything. For slot 2, slot 15. We may lose 1, or we win 2. So after a huge play times, we can win (2-1)x2 = 2 dollars. For slot 1, slot 16. We may lose 1, or we win 3. So after a huge play times, we can win 4 dollars. In total, after a huge play times, we win 4 + 2 + 0 -4 = 2 dollars !!!!! You can test with sorted list 32 or 64 items or any number you want, I believe the benefit is even more. Conclusion: The unbalanced model give us more benefit than the balanced model. That means with an array big enough, My idea give more benefit than the old idea. I think the lucky ticket companies is already know about this. It is a shame that I do not know mathematic principle about this problem. If I have something more, I will update my proof at: https://github.com/nhapq/Optimize_binary_insertion_sort/blob/master/proof.tx... ====== Thank you. Nha Pham. On Mon, Mar 9, 2015 at 10:39 AM, Isaac Schwabacher <ischwabacher@wisc.edu> wrote:

[nha pham <phqnha@gmail.com>]
Sorry for the quick message here. It's just a simple point where it will pay not to get off on a wrong foot ;-) Correct: for an array of size N, binary search can require as many as ceiling(log2(N+1)) comparisons. That's because there are N+1 possible results for an array of size N. For example, for an array of size 3, [A, B, C], "the answer" may be "before A", "between A and B", "between B and C", or "after C". 3 elements, 3+1 = 4 possible results. log2(3) comparisons are not enough to distinguish among 4 results. Make it trivial, an array of length 1. Then 1 comparison is obviously necessary and sufficient in all cases. And, indeed, ceiling(log2(1+1)) = 1. log2(1) equals 0, too small. For the rest, I haven't been able to understand your description or your pseudo-code. I'll try harder. Some things clearly aren't doing what you _intend_ them to do. For example, in your Python code, each time through the outer loop you're apparently trying to sort the next CHUNK elements, but you end up appending CHUNK+1 values to data2 (or data3). Or in this part: for i in range(low,high): x = data[i] if x >= data[i-1]: the first time that loop is executed low == 0, and so i == 0 on the first iteration, and so the conditional is if x >= data[0-1] That's referencing data[-1], which is the very last element in data - which has nothing to do with the CHUNK you're trying to sort at the time. So there are a number of errors here, which makes it that much harder to sort out (pun intended <wink>) what you're trying to do. It would help you to add some asserts to verify your code is doing what you _hope_ it's doing. For example, add assert data2[low: high] == sorted(data[low: high]) assert len(data2) == high to the end of your `sample` loop, and similarly for data3 in your `new` loop. Until those asserts all pass, you're not testing code that's actually sorting correctly. Repair the errors and you almost certainly won't find `new` running over 10 times faster than `sample` anymore. I don't know what you _will_ discover, though. If the code doesn't have to sort correctly, there are much simpler ways to make it run _very_ much faster ;-)

Thank you very much. I am very happy that I got a reply from Tim Peter. You are correct, my mistake. The python code should be: for i in range(low+1,high): //because we already add data[low] x = data[i] if x >= data[i-1]: After I fix it, here is the result: random array 10^6: Old binsort: 1.3322 New binsort: 1.0015 ratio: 0.33 You are right, it is not ten times faster anymore. I will update other results soon. I do check the result of two sorting methods many times to make sure they are the same. It is just because I do not know how to put assert into the timeit.Timer class. I am pretty sure about this. I will try to write the proof more clearly, sorry for inconvenience. Thank you very much. Nha Pham. On Mon, Mar 9, 2015 at 9:27 PM, Tim Peters <tim.peters@gmail.com> wrote:

[nha pham <phqnha@gmail.com>]
Thank you very much. I am very happy that I got a reply from Tim Peter.
My pleasure to speak with you too :-)
`assert` is just another Python statement. You simply add it to the code - there's nothing tricky about this. You could, e.g., simply copy and paste the `assert`s I suggested last time. Before you do, trying adding `print index` to your inner loops, and make SIZE much smaller (say, 1000) so you're not overwhelmed with output. You'll be surprised by what you see on the second (and following) CHUNKs. For example, in both `sample` and `new` it will print 900 ninety nine times in a row when doing the last CHUNK. The code still isn't doing what you intend. Until it does, timing it makes little sense :-)
I am pretty sure about this.
Note that I'm talking about the Python code here, the code you run through timeit. You cannot have checked the results of running _that_ specific code, because it doesn't work at all. You may have checked _other_ code many times. We may get to that later, but since I speak Python, I'm not going to understand what you're doing until we have Python code that works ;-)

OK - here's what the current binsort does, ignoring that it skips an already-sorted prefix (if any), and creating a new list instead of modifying in-place: def oldsort(a): from bisect import bisect_right as br assert a result = [a[0]] for i in range(1, len(a)): x = a[i] index = br(result, x) result.insert(index, x) return result And here's my best guess as to what you _intend_ the new version to do. Please confirm that, or, if I'm guessing wrong, please give a Python function that _does_ implement your intent: def newsort(a): from bisect import bisect_right as br assert a oldx = a[0] result = [oldx] index = 0 for i in range(1, len(a)): x = a[i] if x < oldx: index = br(result, x, 0, index) else: index = br(result, x, index + 1) result.insert(index, x) oldx = x return result Now assuming that's right, I don't care about timing it ;-) The only basic question to me is whether it in fact reduces the number of comparisons. So here's an integer wrapper that bumps a global counter whenever it's asked to compare: class IntWrap(object): def __init__(self, i): self.i = i def __cmp__(a, b): global gncmp gncmp += 1 return cmp(a.i, b.i) def __repr__(self): return repr(self.i) Now we can build lists containing that, and get exact comparison counts. To start, for a given length `n`, this counts the total number of comparisons needed to sort all possible permutations of a list of length n, under both the old and new ways: def drive(n): import itertools global gncmp base = [IntWrap(i) for i in range(n)] oldcount = newcount = 0 numperms = 0 for p in itertools.permutations(base): numperms += 1 gncmp = 0 oldresult = oldsort(p) oldcount += gncmp gncmp = 0 newresult = newsort(p) newcount += gncmp assert oldresult == newresult == base print 'n', n, 'perms', numperms print 'old compares', oldcount print 'new compares', newcount print 'diff %', (newcount - oldcount) * 100.0 / max(oldcount, 1) And, finally, a tiny loop to drive it: for n in range(1, 13): print drive(n) It's still running as I type this, but the results aren't promising so far - as soon as the list length gets non-trivial, the new way requires more comparisons than the old way so far: n 1 perms 1 old compares 0 new compares 0 diff % 0.0 n 2 perms 2 old compares 2 new compares 2 diff % 0.0 n 3 perms 6 old compares 16 new compares 16 diff % 0.0 n 4 perms 24 old compares 112 new compares 116 diff % 3.57142857143 n 5 perms 120 old compares 848 new compares 880 diff % 3.77358490566 n 6 perms 720 old compares 7008 new compares 7296 diff % 4.1095890411 n 7 perms 5040 old compares 63456 new compares 66432 diff % 4.68986384266 n 8 perms 40320 old compares 628608 new compares 662496 diff % 5.39095907147 n 9 perms 362880 old compares 6826752 new compares 7202304 diff % 5.50118123523 n 10 perms 3628800 old compares 80605440 new compares 85006080 diff % 5.45948263542 I believe it would be very difficult to analyze this rigorously - and even if I saw an analysis it would be hard to trust it. Raw counts from simple code are hard to argue with ;-) FYI, here are two ideas I had way back when, but didn't pursue: 1. Merge "2 at a time" instead of just 1. That is, first "sort" the next 2 elements to be merged (1 compare and a possible swap). Then binary search to find where the smaller belongs, and a shorter binary search to find where the larger belongs. Then shift both into place. This can win on two counts: A. Less data movement, since the already-sorted values after the larger element get moved only once instead of twice. B. A possible cut in number of compares. Merging a sorted list of N elements with a sorted list of 2 elements has been studied a lot (e.g., search for "optimal merging of 2 elements" and find the paper by Hwang and Lin). The minimum average theoretical number of compares needed is ceiling(log2((N+2)*(N+1)/2)). 2. Instead of binary insertion sort, do an ordinary (but optimized) bottom-up merge sort. That may not cut the number of compares, but would slash worst-case data movement cost from O(n**2) to O(n*log(n)). As to why your code is sometimes faster, for the Python code in your timing harness, well, that didn't actually sort anything, so wasn't measuring anything interesting (or even explainable ;-) ). For the Java code, I have no guess - I don't know enough about Java internals. Maybe "lucky" data, maybe cache effects, maybe a mistake - don't know, and can't guess. Or maybe my guess (above) at the intent of your code is all wrong. It _is_ an interesting idea, though! Thanks for bringing it up :-)

Hi Tim, On 10 March 2015 at 18:22, Tim Peters <tim.peters@gmail.com> wrote:
Good idea, but when I tried that it seemed to increase the total number of comparisons (on random inputs with only up to 136 items). The increase is on the order of 5%. I'm not sure reduced data movement can compensate for that in Python. Test and code available here: https://bitbucket.org/arigo/arigo/src/default/hack/pypy-hack/list_sort/ The change to insert two items at a time is here: https://bitbucket.org/arigo/arigo/commits/68e04d143dc242cfd9e3934451321f685a... (This is taken straight from PyPy's code.) A bientôt, Armin.

[Tim]
[Armin]
Which is another way of saying "bad idea" - that must be why I didn't pursue it to begin with ;-) Thanks for trying! I plugged a similar routine into the code I showed before to count the # of comparisons in Nha Pham's idea, and this "merge 2 at a time" thing has a higher average # of compares (over all permutations) than Nha's (which in turn has a higher average than the status quo). That makes some sense, thinking about what they do. Nha's algorithm has some supernaturally good cases (input array already ascending or already descending), but "merge 2 at a time" doesn't appear to have any. In any case, the information-theoretic minimum average number of comparisons for merging N sorted elements with 2 sorted elements is ("where do the 2 belong in the final list of N+2 elements?" = comb(N+2, 2)): log2((N+2)*(N+1)/2) = log2(N+2) + log2(N+1) - 1 Add a comparison to get the 2 elements in order to begin with, and we're up to log2(N+2) + log2(N+1) Two independent binary inserts (first to a list of size N, and then to a list of size N+1) comes out to the same. So even being supernaturally clever can't reduce the average number of compares this way. And since, in context, we're only looking at short arrays, a marginal saving in data movement costs (which are still O(N**2) worst case) are unlikely to be significant. Still, if anyone wants to go nuts ... ;-)
participants (4)
-
Armin Rigo
-
Isaac Schwabacher
-
nha pham
-
Tim Peters