tim.one at comcast.net
Tue Oct 14 10:58:10 EDT 2003
> After reading this exchange, I'm not sure I agree with Tim about the
> importance of avoiding to compare the full records. Certainly the
> *cost* of that comparison doesn't bother me: I expect it's usually
> going to be a tuple or list of simple types like ints and strings, and
> comparing those is pretty efficient.
I made the remark about cost in reference to database sorts specifically,
where falling back to whole-record comparison can very easily double the
cost of a sort -- or much worse than that. We saw that in a sample
database-sort application Kevin Altis sent when the 2.3 sort was being
developing, where the results were grossly distorted at first because the
driver *didn't* arrange to break key ties with cheap compares. Sort a
database of companies by (as that example did, and among other things) the
stock exchange each is listed on, and you're guaranteed that a great many
duplicate keys exist (there are more companies than stock exchanges).
Comparing two ints is then vastly cheaper than firing up a written-in-Python
general database-record long-winded __cmp__ function.
> Sure, there's a pattern that requires records with equal to remain in
> the original order, but that seems an artefact of a pattern used only
> for external sorts (where the sorted data is too large to fit in
> memory -- Knuth Vol. III is full of algorithms for this, but they seem
> mostly of historical importance in this age of Gigabyte internal
It's not externality, it's decomposability: stability is what allows an
N-key primary-secondary-etc sort to be done one key at a time instead, in N
passes, and get the same result either way. Almost all sorts you're likely
to use in real life are stable in order to support this, whether it's
clicking on an email-metadata column in Outlook, or sorting an array of data
by a contained column in Excel. These are in-memory sorts, but interactive,
where the user refines sort criteria on the fly and the app has no memory of
what steps were taken before the current sort. Then stability is essential
to getting the right result -- or the user has to fill out a complex
multi-key sort dialog each time.
> The pattern is that if you want records sorted by zipcode and within
> each zipcode sorted by name, you first sort by name and then do a
> stable sort by zipcode. This was common in the days of external
I wager it's much more common now (not externality, but sorting first by one
key, then by another -- it's interactivity that drives this now).
> But using Raymond's proposal, you can do that by specifying a tuple
> consisting of zipcode and name as the key, as follows:
> myTable.sort(key = lambda rec: (rec.zipcode, rec.name))
> This costs an extra tuple, but the values in the tuple are not new, so
> it costs less space than adding the index int (not even counting the
> effect of the immortality of ints).
A hidden cost is that apps supporting interactive (or any other form of
multi-stage) sort refinements have to keep track of the full set of sort
keys ever applied.
> And tuples aren't immortal. (To be honest, for tuples of length < 20,
> the first 2000 of a given size *are* immortal, but that's a strictly
> fixed amount of wasted memory.)
I don't care about that either.
> Given that this solution isn't exactly rocket science, I think the
> default behavior of Raymond's original proposal (making the whole
> record the second sort key) is fine.
It does approach rocket science for an end user to understand why their
database sort is so slow in the presence of many equal keys, and the absence
of a cheap tie-breaker. It's something I didn't appreciate either until
digging into why Kevin's sample app was so bloody slow in some cases (but
not all! it was the sorts with many equal keys that were pig slow, and
because-- as the app was coded --they fell back to whole-record comparison).
More information about the Python-Dev