Kevin Altis kindly forwarded a 1.5MB XML database with about 6600 company records. A record looks like this after running his script to turn them into Python dicts: {'Address': '395 Page Mill Road\nPalo Alto, CA 94306', 'Company': 'Agilent Technologies Inc.', 'Exchange': 'NYSE', 'NumberOfEmployees': '41,000', 'Phone': '(650) 752-5000', 'Profile': 'http://biz.yahoo.com/p/a/a.html', 'Symbol': 'A', 'Web': 'http://www.agilent.com'} It appears to me that the XML file is maintained by hand, in order of ticker symbol. But people make mistakes when alphabetizing by hand, and there are 37 indices i such that data[i]['Symbol'] > data[i+1]['Symbol'] So it's "almost sorted" by that measure, with a few dozen glitches -- and msort should be able to exploit this! I think this is an important case of real-life behavior. The proper order of Yahoo profile URLs is also strongly correlated with ticker symbol, while both the company name and web address look weakly correlated, so there's hope that msort can get some benefit on those too. Here are runs sorting on all field names, building a DSU tuple list to sort via values = [(x.get(fieldname), x) for x in data] Each field sort was run 5 times under sort, and under msort. So 5 times are reported for each sort, reported in milliseconds, and listed from quickest to slowest: Sorting on field 'Address' -- 6589 records via sort: 43.03 43.35 43.37 43.54 44.14 via msort: 45.15 45.16 45.25 45.26 45.30 Sorting on field 'Company' -- 6635 records via sort: 40.41 40.55 40.61 42.36 42.63 via msort: 30.68 30.80 30.87 30.99 31.10 Sorting on field 'Exchange' -- 6579 records via sort: 565.28 565.49 566.70 567.12 567.45 via msort: 573.29 573.61 574.55 575.34 576.46 Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 120.15 120.24 120.26 120.31 122.58 via msort: 134.25 134.29 134.50 134.74 135.09 Sorting on field 'Phone' -- 6589 records via sort: 53.76 53.80 53.81 53.82 56.03 via msort: 56.05 56.10 56.19 56.21 56.86 Sorting on field 'Profile' -- 6635 records via sort: 58.66 58.71 58.84 59.02 59.50 via msort: 8.74 8.81 8.98 8.99 8.99 Sorting on field 'Symbol' -- 6635 records via sort: 39.92 40.11 40.19 40.38 40.62 via msort: 6.49 6.52 6.53 6.72 6.73 Sorting on field 'Web' -- 6632 records via sort: 47.23 47.29 47.36 47.45 47.45 via msort: 37.12 37.27 37.33 37.42 37.89 So the hopes are realized: msort gets huge benefit from the nearly-sorted Symbol field, also huge benefit from the correlated Profile field, and highly significant benefit from the weakly correlated Company and Web fields. K00L! The Exchange field sort is so bloody slow because there are few distinct Exchange values, and whenever there's a tie on those the tuple comparison routine tries to break it by comparing the dicts. Note that I warned about this kind of thing a week or two ago, in the context of trying to implement priority queues by storing and comparing (priority, object) tuples -- it can be a timing disaster if priorities are ever equal. The other fields (Phone, etc) are in essentially random order, and msort is systematically a bit slower on all of those. Note that these are all string comparisons. I don't think it's a coincidence that msort went from a major speedup on the Python-Dev task, to a significant slowdown, when I removed all duplicate lines and shuffled the corpus first. Only part of this can be accounted for by # of comparisons. On a given random input, msort() may do fewer or more comparisons than sort(), but doing many trials suggests that sort() has a small edge in # of compares on random data, on the order of 1 or 2% This is easy to believe, since msort does a few things it *knows* won't repay the cost if the order happens to be random. These are well worth it, since they're what allow msort to get huge wins when the data isn't random. But that's not enough to account for things like the >10% higher runtime in the NumberOfEmployees sort. I can't reproduce this magnitude of systematic slowdown when doing random sorts on floats or ints, so I conclude it has something to do with string compares. Unlike int and float compares, a string compare takes variable time, depending on how long the common prefix is. I'm not aware of specific research on this topic, but it's plausible to me that partitioning may be more effective than merging at reducing the number of comparisons specifically involving "nearly equal" elements. Indeed, the fastest string-sorting methods I know of move heaven and earth to avoid redundant prefix comparisons, and do so by partitioning. Turns out <wink> that doesn't account for it, though. Here are the total number of comparisons (first number on each line) done for each sort, and the sum across all string compares of the number of common prefix characters (second number on each line): Sorting on field Address' -- 6589 records via sort: 76188 132328 via msort: 76736 131081 Sorting on field 'Company' -- 6635 records via sort: 76288 113270 via msort: 56013 113270 Sorting on field 'Exchange' -- 6579 records via sort: 34851 207185 via msort: 37457 168402 Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 76167 116322 via msort: 76554 112436 Sorting on field 'Phone' -- 6589 records via sort: 75972 278188 via msort: 76741 276012 Sorting on field 'Profile' -- 6635 records via sort: 76049 1922016 via msort: 8750 233452 Sorting on field 'Symbol' -- 6635 records via sort: 76073 73243 via msort: 8724 16424 Sorting on field 'Web' -- 6632 records via sort: 76207 863837 via msort: 58811 666852 Contrary to prediction, msort always got the smaller "# of equal prefix characters" total, even in the Exchange case, where it did nearly 10% more total comparisons. Python's string compare goes fastest if the first two characters aren't the same, so maybe sort() gets a systematic advantage there? Nope. Turns out msort() gets out early 17577 times on that basis when doing NumberOfEmployees, but sort() only gets out early 15984 times. I conclude that msort is at worst only a tiny bit slower when doing NumberOfEmployees, and possibly a little faster. The only measure that doesn't agree with that conclusion is time.clock() -- but time is such a subjective thing I won't let that bother me <wink>.