sorting many arrays from one...

Tue Jul 9 13:30:37 EDT 2002

Jerzy Karczmarczuk wrote:
        ...
>> the same cmp that would be used anyway, you pay well more than an
>> order of magnitude for the privilege of passing a compare function
>> to the sort method!!!
> 
> This was one of my principal griefs when I began to learn Python. I still
> don't know why is it so costly. One may suppose that an interpreted user
> function f which calls cmp may be the main reason, but xx.sort(cmp) is
> also slow [in my rudimentary test about twice as fast as sort(f)].
> 
> Somebody cares about explaining this behaviour? Dynamic dispatching,
> while sort() does not any dereferencing? Or what?

Basically, sort needs to do O(N log N) comparisons -- if each comparison
requires a function call, the time for those calls dominates sorting times
even if each call does little -- the overhead is the same.  If you pass
nothing, sort knows to use built-in comparison methods, and its wonderful
optimizations take hold.

An interpreted function increases the overhead, of course, but the
function call mechanism itself is costly.  Indeed passing cmp directly
in my test I see a slowdown of about 5 times, while f which calls cmp
gives 10 times -- see, it IS all about function-call times.

> For people coming from the Functional Programming Realm such slowdown
> of parameterized functionals is hard to accept...

Me, I'm not particularly perturbed by the fact that function calls
have a cost.  Sorting a randomly shuffled sequence of 300,000 items
requires a bit more than 5 million comparisons (just checked this
out with a suitable comparison function incrementing a global
counter as a side effect:-).  A slowdown of about 3.9 CPU seconds
on my machine for that many calls works out at about 0.76 microseconds
per call -- preparing the arguments tuple, unpacking it in the
called function, etc.  Doesn't feel wildly out of control to me --
it's just a LOT of function calls, using mechanisms studied for an
intepreted, dynamic language's needs for flexibility, not to get
superb performance.

What makes it FEEL huge is sort's own superb performance -- 0.16
microseconds per comparison *everything included* when using the
built-in comparison.

So, it's all Tim Peters' fault, for making sort so incredibly fast --
let's all gang up against him!!!

Oh, and, a warning when you try to do such measurements: if you
mistakenly call sort on an already-sorted list, it only does
299999 comparisons (for a 300000 items list) and finishes in
0.04 seconds -- 0.78 with a comparison function, so the ratio
gets even worse.  If you have just one item out of order at
the end, 300019 comparisons, and the same performances.  This
is truly intolerable -- we HAVE to slow this darn this down!!!

>> Don't do it.  Use D-S-U, as AMPLY explained in the searching and
>> sorting chapter of the Python Cookbook
> 
> I don't want to depreciate your helpful attitude addressed to a person
> in need, nor say anything bad about something considered a "common idiom",
> but the idea of Decorate-Sort-Undecorate, which begins by constructing
> an additional list with fast-sortable keys is also something hard to
> digest.

May I suggest some Fernet-Branca or other Italian amaro to your liking?
They do wonders to help you digest rich, too-abundant meals.

Say I have to sort N items.  Transforming each into something that's
most rapidly comparable, or performing the actual comparison, will
typically have a similar CPU cost "X".  A comparison may often be
closer to a cost of 2X, actually (two items to compute on), but
let's forget that for now.

Now, by preparing a list of N fastsortable items, I do N*X work

To sort a list of N non-preprocessed items, I do N*logN*X work.

Thus, preparing the list is logN times faster than not preparing it.
Infinitely faster, as N grows...:-).

Not quite, of course, because I STILL have some N*logN work to
do even with DSU -- say N*logN*eps for some small eps < X.  Even
though eps is small, for large enough N the N*X term becomes
irrelevant and the performance ratio approaches X/eps.

So, if preparing fastsortable items takes about 10 times longer
than the fastest possible comparison (and pointer swap), then
DSU speeds things up by almost 10 times for substantial N.

What's surprising, at all, about this...?

Suppose there was no call overhead at all, just a very
complicated comparison.  The effect would be exactly the same,
no?  Appliying a complicated comparison function must happen
N logN times; applying an equally complicated transformation
function that makes the items rapidly comparable only must
happen N times, and thus becomes irrelevant for large N.

So, the X/eps ratio generalizes to just about any sorting
task that can be represented with either complicated comparisons
OR equally complicated transformations into a form that can
then be compared quickly.

The call-overhead only ensures that X is never less than
(say) 0.8 microseconds/call in Python (typically more, as
you don't JUST have the call...).  sort's excellent own
optimization only ensures that eps can be as low as 0.16
microseconds/comparison-and-swap on the same machine.  So,
a ratio of 5 can be achieved even with the simplest
comparison function, 10 with one not much more complicated,
and so on.

But the same effect will apply in ANY language for the
appropriate kind of sorting task if the comparison op
you need grows rich and complicated enough to matter.

Assuming the language's sort has perfect optimization
and the language makes it very easy to build, handle
and dismantle sequences of tuples, of course:-).  But
these are hardly DEFECTS, are they?-)

> The author of the original question had 300 000 items. I had
> once five times more, and I found it a bit ridiculous to pollute the
> memory with an ephemeric sequence just because there is something lousy
> with parametrized sort (and possibly other H. O. functions?)
> What do you think?

I think you're using loaded language, presumably reflecting
loaded and not-well-considered sentiments.  There's nothing
"lousy" with functions in Python -- they just take measurable
time to call, that's all.  Most often one doesn't care all
that much about performance.  Sorting large datasets is most
often an exception -- one DOES care, and there's a neat way
to minimize the amount of effort spent in comparison, including
function-call overhead.

DSU, and more generally the possibility of preprocessing data
(with linear cost O(N)) to speed up further processing to come
(with higher O() costs), is an excellent technique to keep in
mind in any language, and Python makes it very easy to use.
That's not "polluting the memory with an ephemeric sequence" --
it's investing some memory (not much -- O(N) memory, peanuts!)
to reduce computation.  A hallowed, time-honored approach, you know.

Besides, in case you hadn't noticed, the OP needed to use DSU
anyway as the most effective way to sort *ANYWAY* -- he's
sorting several "parallel" arrays based on the values in just
one of them.  If the "basing" needs complicated functions to
either compare or preprocess the sort keys, that just needs
one extra slot in each of the tuples he'd need *anyway*....

Alex