[Python-Dev] RE: companies data for sorting comparisons
Tim Peters
tim.one@comcast.net
Sun, 28 Jul 2002 04:07:12 -0400
[Tim]
> ...
> Sorting on field 'NumberOfEmployees' -- 6531 records
> via sort: 120.15 120.24 120.26 120.31 122.58
> via msort: 134.25 134.29 134.50 134.74 135.09
> ...
> [where the # of comparisons done is]
> Sorting on field 'NumberOfEmployees' -- 6531 records
> via sort: 76167 ...
> via msort: 76554 ...
> ...
> [and various hypotheses for why it's >10% slower anyway don't pan out]
> ...
> I conclude that msort is at worst only a tiny bit slower when doing
> NumberOfEmployees, and possibly a little faster. The only measure that
> doesn't agree with that conclusion is time.clock() -- but time is such a
> subjective thing I won't let that bother me <wink>.
It's the dicts again. NumberOfEmployees isn't always unique, and in
particular it's missing in 6635-6531 = 104 records, so that
values = [(x.get(fieldname), x) for x in data]
includes 104 tuples with a None first element. Comparing a pair of those
gets resolved by comparing the dicts, and dict comparison ain't cheap.
Building the DSU tuple-list via
values = [(x.get(fieldname), i, x) for i, x in enumerate(data)]
instead leads to
Sorting on field 'NumberOfEmployees' -- 6531 records
via sort: 47.47 47.50 47.54 47.66 47.75
via msort: 48.21 48.23 48.43 48.81 48.85
which gives both methods a huge speed boost, and cuts .sort's speed
advantage much closer to its small advantage in total # of comparisons. I
expect it's just luck of the draw as to which method is going to end up
comparing tuples with equal first elements more often, and msort apparently
does in this case (and those comparisons are more expensive, because they
have to go on to invoke int compare too).
A larger lesson: even if Python gets a stable sort and advertises stability
(we don't have to guarantee it even if it's there), there may *still* be
strong "go fast" reasons to include an object's index in its DSU tuple.
tickledly y'rs - tim