Guido van Rossum
guido at python.org
Tue Oct 14 12:31:47 EDT 2003
> > After reading this exchange, I'm not sure I agree with Tim about the
> > importance of avoiding to compare the full records. Certainly the
> > *cost* of that comparison doesn't bother me: I expect it's usually
> > going to be a tuple or list of simple types like ints and strings, and
> > comparing those is pretty efficient.
> I made the remark about cost in reference to database sorts specifically,
> where falling back to whole-record comparison can very easily double the
> cost of a sort -- or much worse than that.
When exactly do you consider a sort a "database sort"?
> We saw that in a sample
> database-sort application Kevin Altis sent when the 2.3 sort was being
> developing, where the results were grossly distorted at first because the
> driver *didn't* arrange to break key ties with cheap compares. Sort a
> database of companies by (as that example did, and among other things) the
> stock exchange each is listed on, and you're guaranteed that a great many
> duplicate keys exist (there are more companies than stock exchanges).
> Comparing two ints is then vastly cheaper than firing up a written-in-Python
> general database-record long-winded __cmp__ function.
No argument there.
> > Sure, there's a pattern that requires records with equal to remain in
> > the original order, but that seems an artefact of a pattern used only
> > for external sorts (where the sorted data is too large to fit in
> > memory -- Knuth Vol. III is full of algorithms for this, but they seem
> > mostly of historical importance in this age of Gigabyte internal
> > memory).
> It's not externality, it's decomposability: stability is what allows an
> N-key primary-secondary-etc sort to be done one key at a time instead, in N
> passes, and get the same result either way. Almost all sorts you're likely
> to use in real life are stable in order to support this, whether it's
> clicking on an email-metadata column in Outlook, or sorting an array of data
> by a contained column in Excel.
I experimented a bit with the version of Outlook I have, and it seems
to always use the delivery date/time as the second key, and always in
> These are in-memory sorts, but interactive,
> where the user refines sort criteria on the fly and the app has no memory of
> what steps were taken before the current sort. Then stability is essential
> to getting the right result -- or the user has to fill out a complex
> multi-key sort dialog each time.
I'm not sure that this helps us decide the default behavior of sorts
in Python, which are rarely interactive in this sense. (If someone
writes an Ourlook substitute, they can pretty well code the sort to do
whatever they want.)
> > The pattern is that if you want records sorted by zipcode and within
> > each zipcode sorted by name, you first sort by name and then do a
> > stable sort by zipcode. This was common in the days of external
> > sorts.
> I wager it's much more common now (not externality, but sorting first by one
> key, then by another -- it's interactivity that drives this now).
But I don't see the interactivity in Python apps, and that's what
> > ...
> > But using Raymond's proposal, you can do that by specifying a tuple
> > consisting of zipcode and name as the key, as follows:
> > myTable.sort(key = lambda rec: (rec.zipcode, rec.name))
> > This costs an extra tuple, but the values in the tuple are not new, so
> > it costs less space than adding the index int (not even counting the
> > effect of the immortality of ints).
> A hidden cost is that apps supporting interactive (or any other form of
> multi-stage) sort refinements have to keep track of the full set of sort
> keys ever applied.
A small cost compared to the cost of writing an interactive app, *if*
you really want this behavior (I doubt it matters to most people using
> > And tuples aren't immortal. (To be honest, for tuples of length < 20,
> > the first 2000 of a given size *are* immortal, but that's a strictly
> > fixed amount of wasted memory.)
> I don't care about that either.
> > Given that this solution isn't exactly rocket science, I think the
> > default behavior of Raymond's original proposal (making the whole
> > record the second sort key) is fine.
> It does approach rocket science for an end user to understand why their
> database sort is so slow in the presence of many equal keys, and the absence
> of a cheap tie-breaker. It's something I didn't appreciate either until
> digging into why Kevin's sample app was so bloody slow in some cases (but
> not all! it was the sorts with many equal keys that were pig slow, and
> because-- as the app was coded --they fell back to whole-record comparison).
Yeah, understanding performance anomalies is hard.
To cut all this short, I propose that we offer using the index as a
second sort key as an option, on by default, whose name can be (barely
misleading) "stable". On by default nicely matches the behavior of
the 2.3 sort without any options.
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-Dev