[Python-Dev] RE: companies data for sorting comparisons

Tim Peters tim.one@comcast.net
Sun, 28 Jul 2002 01:48:26 -0400

Kevin Altis kindly forwarded a 1.5MB XML database with about 6600 company
records.  A record looks like this after running his script to turn them
into Python dicts:

  {'Address': '395 Page Mill Road\nPalo Alto, CA 94306',
   'Company': 'Agilent Technologies Inc.',
   'Exchange': 'NYSE',
   'NumberOfEmployees': '41,000',
   'Phone': '(650) 752-5000',
   'Profile': 'http://biz.yahoo.com/p/a/a.html',
   'Symbol': 'A',
   'Web': 'http://www.agilent.com'}

It appears to me that the XML file is maintained by hand, in order of ticker
symbol.  But people make mistakes when alphabetizing by hand, and there are
37 indices i such that

    data[i]['Symbol'] > data[i+1]['Symbol']

So it's "almost sorted" by that measure, with a few dozen glitches -- and
msort should be able to exploit this!  I think this is an important case of
real-life behavior.  The proper order of Yahoo profile URLs is also strongly
correlated with ticker symbol, while both the company name and web address
look weakly correlated, so there's hope that msort can get some benefit on
those too.

Here are runs sorting on all field names, building a DSU tuple list to sort

    values = [(x.get(fieldname), x) for x in data]

Each field sort was run 5 times under sort, and under msort.  So 5 times are
reported for each sort, reported in milliseconds, and listed from quickest
to slowest:

Sorting on field 'Address' -- 6589 records
    via  sort:  43.03  43.35  43.37  43.54  44.14
    via msort:  45.15  45.16  45.25  45.26  45.30

Sorting on field 'Company' -- 6635 records
    via  sort:  40.41  40.55  40.61  42.36  42.63
    via msort:  30.68  30.80  30.87  30.99  31.10

Sorting on field 'Exchange' -- 6579 records
    via  sort: 565.28 565.49 566.70 567.12 567.45
    via msort: 573.29 573.61 574.55 575.34 576.46

Sorting on field 'NumberOfEmployees' -- 6531 records
    via  sort: 120.15 120.24 120.26 120.31 122.58
    via msort: 134.25 134.29 134.50 134.74 135.09

Sorting on field 'Phone' -- 6589 records
    via  sort:  53.76  53.80  53.81  53.82  56.03
    via msort:  56.05  56.10  56.19  56.21  56.86

Sorting on field 'Profile' -- 6635 records
    via  sort:  58.66  58.71  58.84  59.02  59.50
    via msort:   8.74   8.81   8.98   8.99   8.99

Sorting on field 'Symbol' -- 6635 records
    via  sort:  39.92  40.11  40.19  40.38  40.62
    via msort:   6.49   6.52   6.53   6.72   6.73

Sorting on field 'Web' -- 6632 records
    via  sort:  47.23  47.29  47.36  47.45  47.45
    via msort:  37.12  37.27  37.33  37.42  37.89

So the hopes are realized:  msort gets huge benefit from the nearly-sorted
Symbol field, also huge benefit from the correlated Profile field, and
highly significant benefit from the weakly correlated Company and Web
fields.  K00L!

The Exchange field sort is so bloody slow because there are few distinct
Exchange values, and whenever there's a tie on those the tuple comparison
routine tries to break it by comparing the dicts.  Note that I warned about
this kind of thing a week or two ago, in the context of trying to implement
priority queues by storing and comparing (priority, object) tuples -- it can
be a timing disaster if priorities are ever equal.

The other fields (Phone, etc) are in essentially random order, and msort is
systematically a bit slower on all of those.  Note that these are all string
comparisons.  I don't think it's a coincidence that msort went from a major
speedup on the Python-Dev task, to a significant slowdown, when I removed
all duplicate lines and shuffled the corpus first.

Only part of this can be accounted for by # of comparisons.  On a given
random input, msort() may do fewer or more comparisons than sort(), but
doing many trials suggests that sort() has a small edge in # of compares on
random data, on the order of 1 or 2%  This is easy to believe, since msort
does a few things it *knows* won't repay the cost if the order happens to be
random.  These are well worth it, since they're what allow msort to get huge
wins when the data isn't random.

But that's not enough to account for things like the >10% higher runtime in
the NumberOfEmployees sort.  I can't reproduce this magnitude of systematic
slowdown when doing random sorts on floats or ints, so I conclude it has
something to do with string compares.  Unlike int and float compares, a
string compare takes variable time, depending on how long the common prefix
is.  I'm not aware of specific research on this topic, but it's plausible to
me that partitioning may be more effective than merging at reducing the
number of comparisons specifically involving "nearly equal" elements.
Indeed, the fastest string-sorting methods I know of move heaven and earth
to avoid redundant prefix comparisons, and do so by partitioning.

Turns out <wink> that doesn't account for it, though.  Here are the total
number of comparisons (first number on each line) done for each sort, and
the sum across all string compares of the number of common prefix characters
(second number on each line):

Sorting on field Address' -- 6589 records
    via  sort: 76188 132328
    via msort: 76736 131081

Sorting on field 'Company' -- 6635 records
    via  sort: 76288 113270
    via msort: 56013 113270

Sorting on field 'Exchange' -- 6579 records
    via  sort: 34851 207185
    via msort: 37457 168402

Sorting on field 'NumberOfEmployees' -- 6531 records
    via  sort: 76167 116322
    via msort: 76554 112436

Sorting on field 'Phone' -- 6589 records
    via  sort: 75972 278188
    via msort: 76741 276012

Sorting on field 'Profile' -- 6635 records
    via  sort: 76049 1922016
    via msort:  8750  233452

Sorting on field 'Symbol' -- 6635 records
    via  sort: 76073 73243
    via msort:  8724 16424

Sorting on field 'Web' -- 6632 records
    via  sort: 76207 863837
    via msort: 58811 666852

Contrary to prediction, msort always got the smaller "# of equal prefix
characters" total, even in the Exchange case, where it did nearly 10% more
total comparisons.

Python's string compare goes fastest if the first two characters aren't the
same, so maybe sort() gets a systematic advantage there?  Nope.  Turns out
msort() gets out early 17577 times on that basis when doing
NumberOfEmployees, but sort() only gets out early 15984 times.

I conclude that msort is at worst only a tiny bit slower when doing
NumberOfEmployees, and possibly a little faster.  The only measure that
doesn't agree with that conclusion is time.clock() -- but time is such a
subjective thing I won't let that bother me <wink>.