RE: companies data for sorting comparisons
Kevin Altis kindly forwarded a 1.5MB XML database with about 6600 company records. A record looks like this after running his script to turn them into Python dicts: {'Address': '395 Page Mill Road\nPalo Alto, CA 94306', 'Company': 'Agilent Technologies Inc.', 'Exchange': 'NYSE', 'NumberOfEmployees': '41,000', 'Phone': '(650) 752-5000', 'Profile': 'http://biz.yahoo.com/p/a/a.html', 'Symbol': 'A', 'Web': 'http://www.agilent.com'} It appears to me that the XML file is maintained by hand, in order of ticker symbol. But people make mistakes when alphabetizing by hand, and there are 37 indices i such that data[i]['Symbol'] > data[i+1]['Symbol'] So it's "almost sorted" by that measure, with a few dozen glitches -- and msort should be able to exploit this! I think this is an important case of real-life behavior. The proper order of Yahoo profile URLs is also strongly correlated with ticker symbol, while both the company name and web address look weakly correlated, so there's hope that msort can get some benefit on those too. Here are runs sorting on all field names, building a DSU tuple list to sort via values = [(x.get(fieldname), x) for x in data] Each field sort was run 5 times under sort, and under msort. So 5 times are reported for each sort, reported in milliseconds, and listed from quickest to slowest: Sorting on field 'Address' -- 6589 records via sort: 43.03 43.35 43.37 43.54 44.14 via msort: 45.15 45.16 45.25 45.26 45.30 Sorting on field 'Company' -- 6635 records via sort: 40.41 40.55 40.61 42.36 42.63 via msort: 30.68 30.80 30.87 30.99 31.10 Sorting on field 'Exchange' -- 6579 records via sort: 565.28 565.49 566.70 567.12 567.45 via msort: 573.29 573.61 574.55 575.34 576.46 Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 120.15 120.24 120.26 120.31 122.58 via msort: 134.25 134.29 134.50 134.74 135.09 Sorting on field 'Phone' -- 6589 records via sort: 53.76 53.80 53.81 53.82 56.03 via msort: 56.05 56.10 56.19 56.21 56.86 Sorting on field 'Profile' -- 6635 records via sort: 58.66 58.71 58.84 59.02 59.50 via msort: 8.74 8.81 8.98 8.99 8.99 Sorting on field 'Symbol' -- 6635 records via sort: 39.92 40.11 40.19 40.38 40.62 via msort: 6.49 6.52 6.53 6.72 6.73 Sorting on field 'Web' -- 6632 records via sort: 47.23 47.29 47.36 47.45 47.45 via msort: 37.12 37.27 37.33 37.42 37.89 So the hopes are realized: msort gets huge benefit from the nearly-sorted Symbol field, also huge benefit from the correlated Profile field, and highly significant benefit from the weakly correlated Company and Web fields. K00L! The Exchange field sort is so bloody slow because there are few distinct Exchange values, and whenever there's a tie on those the tuple comparison routine tries to break it by comparing the dicts. Note that I warned about this kind of thing a week or two ago, in the context of trying to implement priority queues by storing and comparing (priority, object) tuples -- it can be a timing disaster if priorities are ever equal. The other fields (Phone, etc) are in essentially random order, and msort is systematically a bit slower on all of those. Note that these are all string comparisons. I don't think it's a coincidence that msort went from a major speedup on the Python-Dev task, to a significant slowdown, when I removed all duplicate lines and shuffled the corpus first. Only part of this can be accounted for by # of comparisons. On a given random input, msort() may do fewer or more comparisons than sort(), but doing many trials suggests that sort() has a small edge in # of compares on random data, on the order of 1 or 2% This is easy to believe, since msort does a few things it *knows* won't repay the cost if the order happens to be random. These are well worth it, since they're what allow msort to get huge wins when the data isn't random. But that's not enough to account for things like the >10% higher runtime in the NumberOfEmployees sort. I can't reproduce this magnitude of systematic slowdown when doing random sorts on floats or ints, so I conclude it has something to do with string compares. Unlike int and float compares, a string compare takes variable time, depending on how long the common prefix is. I'm not aware of specific research on this topic, but it's plausible to me that partitioning may be more effective than merging at reducing the number of comparisons specifically involving "nearly equal" elements. Indeed, the fastest string-sorting methods I know of move heaven and earth to avoid redundant prefix comparisons, and do so by partitioning. Turns out <wink> that doesn't account for it, though. Here are the total number of comparisons (first number on each line) done for each sort, and the sum across all string compares of the number of common prefix characters (second number on each line): Sorting on field Address' -- 6589 records via sort: 76188 132328 via msort: 76736 131081 Sorting on field 'Company' -- 6635 records via sort: 76288 113270 via msort: 56013 113270 Sorting on field 'Exchange' -- 6579 records via sort: 34851 207185 via msort: 37457 168402 Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 76167 116322 via msort: 76554 112436 Sorting on field 'Phone' -- 6589 records via sort: 75972 278188 via msort: 76741 276012 Sorting on field 'Profile' -- 6635 records via sort: 76049 1922016 via msort: 8750 233452 Sorting on field 'Symbol' -- 6635 records via sort: 76073 73243 via msort: 8724 16424 Sorting on field 'Web' -- 6632 records via sort: 76207 863837 via msort: 58811 666852 Contrary to prediction, msort always got the smaller "# of equal prefix characters" total, even in the Exchange case, where it did nearly 10% more total comparisons. Python's string compare goes fastest if the first two characters aren't the same, so maybe sort() gets a systematic advantage there? Nope. Turns out msort() gets out early 17577 times on that basis when doing NumberOfEmployees, but sort() only gets out early 15984 times. I conclude that msort is at worst only a tiny bit slower when doing NumberOfEmployees, and possibly a little faster. The only measure that doesn't agree with that conclusion is time.clock() -- but time is such a subjective thing I won't let that bother me <wink>.
[Tim]
... Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 120.15 120.24 120.26 120.31 122.58 via msort: 134.25 134.29 134.50 134.74 135.09 ... [where the # of comparisons done is] Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 76167 ... via msort: 76554 ... ... [and various hypotheses for why it's >10% slower anyway don't pan out] ... I conclude that msort is at worst only a tiny bit slower when doing NumberOfEmployees, and possibly a little faster. The only measure that doesn't agree with that conclusion is time.clock() -- but time is such a subjective thing I won't let that bother me <wink>.
It's the dicts again. NumberOfEmployees isn't always unique, and in particular it's missing in 6635-6531 = 104 records, so that values = [(x.get(fieldname), x) for x in data] includes 104 tuples with a None first element. Comparing a pair of those gets resolved by comparing the dicts, and dict comparison ain't cheap. Building the DSU tuple-list via values = [(x.get(fieldname), i, x) for i, x in enumerate(data)] instead leads to Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 47.47 47.50 47.54 47.66 47.75 via msort: 48.21 48.23 48.43 48.81 48.85 which gives both methods a huge speed boost, and cuts .sort's speed advantage much closer to its small advantage in total # of comparisons. I expect it's just luck of the draw as to which method is going to end up comparing tuples with equal first elements more often, and msort apparently does in this case (and those comparisons are more expensive, because they have to go on to invoke int compare too). A larger lesson: even if Python gets a stable sort and advertises stability (we don't have to guarantee it even if it's there), there may *still* be strong "go fast" reasons to include an object's index in its DSU tuple. tickledly y'rs - tim
tim wrote:
A larger lesson: even if Python gets a stable sort and advertises stability (we don't have to guarantee it even if it's there)
if we guarantee it, all python implementors must provide one. how hard is it to implement a reasonably good stable sort from scratch? I can think of lots of really stupid ways to do it on top of existing sort code, which might be a reason to provide two different sort methods: sort (fast) and stablesort (guaranteed, but maybe not as fast as sort). in CPython, both names can map to timsort. (shouldn't you be writing a paper on this, btw? or start a sort blog ;-) </F>
[Tim]
A larger lesson: even if Python gets a stable sort and advertises stability (we don't have to guarantee it even if it's there)
[/F]
if we guarantee it, all python implementors must provide one.
Or a middle ground, akin to CPython's semi-reluctant guarantees of refcount semantics for "timely" finalization. A great many CPython users appear quite happy to rely on this despite that the language doesn't guarantee it.
how hard is it to implement a reasonably good stable sort from scratch?
A straightforward mergesort using a temp vector of size N is dead easy, and reasonably good (O(N log N) worst case). There aren't any other major N log N sorts that are naturally stable, nor even any I know of (and I know of a lot <wink>) that can be made stable without materializing list indices (or a moral equivalent). Insertion sort is naturally stable, but is O(N**2) expected case, so is DOA.
I can think of lots of really stupid ways to do it on top of existing sort code, which might be a reason to provide two different sort methods: sort (fast) and stablesort (guaranteed, but maybe not as fast as sort). in CPython, both names can map to timsort.
I don't want to see two sort methods on the list object, for reasons explained before. You've always been able to *get* a stable sort in Python via materializing the list indices in a 2-tuple, as in Alex's "stable sort" DSU recipe: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52234 People overly <wink> concerned about portability can stick to that.
(shouldn't you be writing a paper on this, btw?
I don't think there's anything truly new here, although the combination of gimmicks may be unique. timsort.txt is close enough to a paper anyway, but better in that it only tells you useful things; the McIlroy paper covers all the rest <wink>.
or start a sort blog ;-)
That must be some sort of web thing, hence beyond my limited abilities.
Turns out there was one comparison per merge step I wasn't extracting maximum value from. Changing the code to suck all I can out of it doesn't make a measurable difference on sortperf results, except for a tiny improvement on ~sort on my box, but makes a difference on the Exchange case of Kevin's data. Here using values = [(x.get(fieldname), i, x) for i, x in enumerate(data)] as the list to sort, and times are again in milliseconds: Sorting on field 'Address' -- 6589 records via sort: 41.24 41.39 41.41 41.42 86.71 via msort: 42.90 43.01 43.07 43.15 43.75 Sorting on field 'Company' -- 6635 records via sort: 40.24 40.34 40.42 40.43 42.58 via msort: 30.42 30.45 30.58 30.66 30.66 Sorting on field 'Exchange' -- 6579 records via sort: 59.64 59.70 59.71 59.72 59.81 via msort: 27.06 27.11 27.19 27.29 27.54 Sorting on field 'NumberOfEmployees' -- 6531 records via sort: 47.61 47.65 47.73 47.75 47.76 via msort: 48.55 48.57 48.61 48.73 48.92 Sorting on field 'Phone' -- 6589 records via sort: 48.00 48.03 48.32 48.32 48.39 via msort: 49.60 49.64 49.68 49.79 49.85 Sorting on field 'Profile' -- 6635 records via sort: 58.63 58.70 58.80 58.85 58.92 via msort: 8.47 8.48 8.51 8.59 8.68 Sorting on field 'Symbol' -- 6635 records via sort: 39.93 40.13 40.16 40.28 41.37 via msort: 6.20 6.23 6.23 6.43 6.98 Sorting on field 'Web' -- 6632 records via sort: 46.75 46.77 46.86 46.87 47.05 via msort: 36.44 36.66 36.69 36.69 36.96 'Profile' is slower than the rest for samplesort because the strings it's comparing are Yahoo URLs with a long common prefix -- the compares just take longer in that case. I'm not sure why 'Exchange' takes so long for samplesort (it's a case with lots of duplicate primary keys, but the distribution is highly skewed, not uniform as in ~sort). In all cases now, msort is a major-to-killer win, or a small (but real) loss. I'll upload a new patch and new timsort.txt next. Then I'm taking a week off! No, I wish it were for fun <wink/sigh>.
Update: With the last batch of checkins, all sorts on Kevin's company database are faster (a little to a killer lot) under 2.3a0 than under 2.2.1. A reminder of what this looks like:
A record looks like this after running his script to turn them into Python dicts:
{'Address': '395 Page Mill Road\nPalo Alto, CA 94306', 'Company': 'Agilent Technologies Inc.', 'Exchange': 'NYSE', 'NumberOfEmployees': '41,000', 'Phone': '(650) 752-5000', 'Profile': 'http://biz.yahoo.com/p/a/a.html', 'Symbol': 'A', 'Web': 'http://www.agilent.com'}
It appears to me that the XML file is maintained by hand, in order of ticker symbol. But people make mistakes when alphabetizing by hand, and there are 37 indices i such that
data[i]['Symbol'] > data[i+1]['Symbol']
So it's "almost sorted" by that measure ... The proper order of Yahoo profile URLs is also strongly correlated with ticker symbol, while both the company name and web address look weakly correlated [and Address, NumberOfEmployess, and Phone are essentially randomly ordered]
Here are the latest (and I expect the last) timings, in milliseconds per sort, on the list of (key, index, record) tuples values = [(x.get(fieldname), i, x) for i, x in enumerate(data)] [I wrote a little generator to simulate 2.3's enumerate() in 2.2.1] There are 6635 companies in the database, but not all fields are present in all records; .get() plugs in a key of None for those cases, and the index is to prevent equal-key cases from falling into breaking the tie via expensive dict comparison (each record x is a dict!): Sorting on field 'Address' 2.2.1: 41.57 2.3a0: 40.96 Sorting on field 'Company' 2.2.1: 40.14 2.3a0: 29.79 Sorting on field 'Exchange' 2.2.1: 53.83 2.3a0: 24.79 Sorting on field 'NumberOfEmployees' 2.2.1: 47.89 2.3a0: 45.74 Sorting on field 'Phone' 2.2.1: 48.09 2.3a0: 47.15 Sorting on field 'Profile' 2.2.1: 58.41 2.3a0: 8.77 Sorting on field 'Symbol' 2.2.1: 40.78 2.3a0: 6.30 Sorting on field 'Web' 2.2.1: 46.79 2.3a0: 35.64 This may have been sorted more times by now than any other database on Earth <wink>.
participants (3)
-
Fredrik Lundh
-
Tim Peters
-
Tim Peters