[Python-Dev] decorate-sort-undecorate

Mon Oct 13 23:31:58 EDT 2003

> [Phillip J. Eby]
> > What expense?  The extra memory overhead for the index?  I suppose
> > so.
> 
> Yes, that is an expense.  Partly because of the extra memory space in
> len(list) temp tuples, but mostly because space allocated for integer
> objects is immortal.  That is,
> 
>     range(10000000)
> 
> grabs space for 1000000 distinct integer objects that's never reused for any
> other kind of object, and so does stuffing a million distinct int objects
> into a temp DSU list.  Note that this is very different from doing
> 
>     for i in xrange(1000000):
> 
> which allocates space for only three integer objects (100000, the current
> value of i, and preceding value of i), and keeps reusing it.
> 
> A cleverer implementation might be able to avoid permanently ratcheting the
> space devoted to int objects.
> 
> > But if you *don't* want that behavior, you can still DSU manually, no?
> 
> I hope so <wink>.

After reading this exchange, I'm not sure I agree with Tim about the
importance of avoiding to compare the full records.  Certainly the
*cost* of that comparison doesn't bother me: I expect it's usually
going to be a tuple or list of simple types like ints and strings, and
comparing those is pretty efficient.

Sure, there's a pattern that requires records with equal to remain in
the original order, but that seems an artefact of a pattern used only
for external sorts (where the sorted data is too large to fit in
memory -- Knuth Vol. III is full of algorithms for this, but they seem
mostly of historical importance in this age of Gigabyte internal
memory).

The pattern is that if you want records sorted by zipcode and within
each zipcode sorted by name, you first sort by name and then do a
stable sort by zipcode.  This was common in the days of external
sorts.  (External sorts are still common in some application domains,
of course, but I doubt that you'd find much Python being used there
for the main sort.)

But using Raymond's proposal, you can do that by specifying a tuple
consisting of zipcode and name as the key, as follows:

  myTable.sort(key = lambda rec: (rec.zipcode, rec.name))

This costs an extra tuple, but the values in the tuple are not new, so
it costs less space than adding the index int (not even counting the
effect of the immortality of ints).  And tuples aren't immortal.  (To
be honest, for tuples of length < 20, the first 2000 of a given size
*are* immortal, but that's a strictly fixed amount of wasted memory.)

Given that this solution isn't exactly rocket science, I think the
default behavior of Raymond's original proposal (making the whole
record the second sort key) is fine.

--Guido van Rossum (home page: http://www.python.org/~guido/)