sorting many arrays from one...

Wed Jul 10 09:12:30 EDT 2002

Jerzy Karczmarczuk wrote:

> Alex Martelli comments my remarks on the slowdown of the sorting
> functional parameterized by a named comparison function:
> 
>> > Somebody cares about explaining this behaviour? Dynamic dispatching,
>> > while sort() does not any dereferencing? Or what?
> 
> 
>> Basically, sort needs to do O(N log N) comparisons -- if each comparison
>> requires a function call, the time for those calls dominates sorting
>> times even if each call does little -- the overhead is the same.
> 
> Sorry, but this does not explain anything. Each comparison requires a
> function call (perhaps inlined in some cases, but I don't believe that
> 'compare' may be universally inlined) whether this comparison is passed as

Not in Python terms.  The underlying engine (written in C, Java, C#, or
whatever) may arrange execution however it pleases, but it typically does
not need to perform a Python function call per comparison.

> a parameter, or called "because the sorting device <<knows>> what to
> call". So, thank you very much indeed for your nice N log N remainder, but
> the issue is not the cost of the *call* (execution), but of "call"
> (dereferencing/dispatching).

The cost of the call is not the execution: it's the cost of the call.
Preparing arguments, putting them on-stack as a tuple, getting and
unpacking that tuple from stack in the called function, &c.  Not sure
what dereferencing and what dispatching you have in mind, but argument
passing, which is neither dereferencing nor dispatching, IS part of
the cost of the call itself.
That's quite separate from the cost of the _comparison_ operations,
performed either in the function being called, or more directly in
a comparison-method.

>> If you pass
>> nothing, sort knows to use built-in comparison methods, and its wonderful
>> optimizations take hold.
> 
> May I ask somebody WHO KNOWS, what are those "wonderful optimizations",

Why, of course -- Tim Peters, their author, explains them well in his 
preface to the Searching and Sorting chapter in the Python Cookbook that
O'Reilly publishes these days -- I would suggest you read that.  But
of course studying the code in Objects/listobject.c is the ultimate
way to learn about them -- samplesort, fallback to cleverly tweaked
insertion-sort, etc, etc.

> and why are they so brutally desactivated with, say, .sort(cmp)?

Nothing is "desactivated", either brutally or otherwise: simply, when
you have a different major bottleneck, optimizations to other parts
that were crucial when THOSE parts were the bottleneck diminish in
their effectiveness on overall performance -- Amdahl's Law.  The sort
method of list objects still performs the minimum possible number of
comparisons, whether or not you call it with a comparison function
(assuming that function is perfectly correct, of course -- all bets
are off otherwise:-).  But how much time it costs to perform each
of these comparisons is totally out of the sort method's hands, of course.

>> Indeed passing cmp directly
>> in my test I see a slowdown of about 5 times, while f which calls cmp
>> gives 10 times -- see, it IS all about function-call times.
> 
> Thank you for confirming my test. And for confirming that all IS about
> the function call. I believe I made it clear that I am aware of that.
> I asked WHY? One of important paradigms of modern programming are higher-
> order functions.

Sure.  Python supports callables of many kinds smoothly and fluently,
and higher-order functions are among those many kinds.  But each call
that is actually performed still takes a given amount of time, period.

>> > For people coming from the Functional Programming Realm such slowdown
>> > of parameterized functionals is hard to accept...
> 
>> A slowdown of about 3.9 CPU seconds
>> on my machine for that many calls works out at about 0.76 microseconds
>> per call -- preparing the arguments tuple, unpacking it in the
>> called function, etc.  Doesn't feel wildly out of control to me --
>> it's just a LOT of function calls, using mechanisms studied for an
>> intepreted, dynamic language's needs for flexibility, not to get
>> superb performance.
> 
> Aha, preparing argument tuples (you mean: making them physically, or
> just stacking arguments, or what?). Yes, it takes some time, but there
> are interpreted languages (e.g. ML) where it is very well optimized.
> No physical unpacking of any tuples inside a function.

I'm not sure what you mean by calling ML "an interpreted language" --
I have used compilers for several ML dialects that made extremely
effective machine-code (I'm thinking specifically about O'CAML, but
I don't think SML trails behind, or not by much -- and surely there's
nothing about the LANGUAGE that impedes it).  You may surely have
interpreted implementations of whatever language you want, but the
possibility of using interpreters for (e.g.) Motorola 68000 machine
language (as routinely done in modern Macintosh computers) doesn't
mean it's "an interpreted language"... it's still a machine-language
no matter what the implementation used at a given time is.

ML is designed to allow compile-time type inference, which among other
pluses conceptually allows all sort of optimizations (whether a given
implementation performs them or not is obviously an issue of the
given implementation, not of the language itself).  Optimization of
function calls, arguments passing, et al, is clearly among those.

Python has a very different language design, firmly rooted in run-time
type determination.  That might theoretically not preclude some level
of (maybe JIT) type inference -- scheme's compiler 'stalin' being a
good example of the best possibilities -- several applications of type
inference to Python have been proposed and/or attempted, one of the
most interesting currently alive projects being Psyco, see:
http://homepages.ulb.ac.be/~arigo/psyco/

In practice, the standard Python implementation does almost NO
optimizations at all -- and definitely none requiring any substantial
complication whatsoever.  Just about all optimization effort has
historically gone to the C-coded internals, sort method and dict
behavior being two excellent examples.  The philosophy of simplicity
as first and foremost has served Python well, yielding a stable,
industrial-level, production-ready implementation for a language
whose strong point is maximizing programmer productivity in the
real world.  So, don't hold your breath waiting for advanced, highly
complicated optimization techniques to appear in the "production"
version of Python.  If some such technique is your hobby-horse,
joining the Python development community is probably just the only
way giving some concrete hope of ever seeing it widely deployed.

>> What makes it FEEL huge is sort's own superb performance -- 0.16
>> microseconds per comparison *everything included* when using the
>> built-in comparison.
>> 
>> So, it's all Tim Peters' fault, for making sort so incredibly fast --
>> let's all gang up against him!!!
> 
> Now, put 4 more (!) marks and it will be even more funny... But, seriously
> again a question for somebody who *knows* the details: concretely WHAT is
> optimized, and WHY it cannot be done with parameterized sort.

See above.  It's pretty useless to repeat questions within a single
posts, isn't it?

>> Oh, and, a warning when you try to do such measurements: if you
>> mistakenly call sort on an already-sorted list, it only does
>> 299999 comparisons (for a 300000 items list) and finishes in
>> 0.04 seconds -- 0.78 with a comparison function, so the ratio
>> gets even worse.  If you have just one item out of order at
>> the end, 300019 comparisons, and the same performances.  This
>> is truly intolerable -- we HAVE to slow this darn this down!!!
> 
> I happen to know a little bit about serious sorting, so the warning
> is a bit redundant, but thank you anyway, it suggests at least that
> the sorting procedure does not use the naive quicksort. Unfortunately,
> apart from an attempt to be funny, which I appreciate enormously ,
> it does not address my questions...

Again as said previously: nothing can answer them better than
the freely available sources; Tim's essay in the Cookbook can
help, but maybe not if you know _more_ than "a little bit about
serious sorting".  It so happens that Python list's sort method
is the best-performing sort I've ever seen deployed in production
code for many special cases of enormous practical importance
(lists that are close to being ordered or just out of order,
lists with many duplicate elements, ...), but I'm sure all of
the techniques ARE mentioned somewhere (probably in Mixal code)
in the "Sorting and Searching" volume of Knuth's AoCP.

>> > ... the idea of Decorate-Sort-Undecorate, which begins by constructing
>> > an additional list with fast-sortable keys is also something hard to
>> > digest.
>> 
>> May I suggest some Fernet-Branca or other Italian amaro to your liking?
>> They do wonders to help you digest rich, too-abundant meals.
> 
> My <<milieu>> has perhaps some different views on what is a too-abundant
> meal, and also some different view of what is an elegant answer to a
> technical question.

Perhaps even on what IS "a technical question" -- "hard to digest", in
the milieux I'm familiar with, is only acceptable as such when discussing
gastronomy or gastroenterology; otherwise, it doesn't qualify.  That
you're unable to understand the performance characteristics of pieces
of freely available source code that you choose not to examine is hardly
"a technical question", after all.

> Further on, you suggest (in a long and explicit way)
> that it is a good idea, because the speed of comparison makes you gain
> alpha * NlogN, while the cost of decorating/undecorating is linear.
> That's OK, I never claimed that it wouldn't work. I used it myself, and
> I submitted my query because I am still convinced that it shouldn't be
> done. For some people wasting space *is* harmful independently of the
> asymptotic complexity issues.

Wasting != investing.  To waste means to consume without good, useful
purpose.  To invest a resource is to direct it to a useful desired end.

Decorate-Sort-Undecorate is a pattern, or idiom, that shows how to
*invest* space (memory -- O(N) amount thereof) for the purpose of
speeding up a sort (which remains O(N logN), but gains a better
multiplicative constant).  DSU is one of a class of patterns, or
idioms, regarding time-space tradeoffs in performance optimization.

Why shouldn't space-time tradeoffs be appropriately performed to
obtain desired ends?!  If "for some people" such tradeoffs are
unacceptable under any circumstances, then those people should not
study optimization techniques of this kind, and should maybe see
an analyst about their deep-seated psychological problems (but
that's up to them, of course).

For the rest of us, the pragmatists that make up this happy band
of brothers, "shouldn't be done" is an absurd assertion to make
about any programming technique which yields desired results.  If
it works (and you "never claimed that it won't work"), then *WHY*
shouldn't it be done?!  Such intolerance smacks of religious sects
more than of technical debates, particularly in such a pragmatic
field as that of programming.

>> What's surprising, at all, about this...?
> 
> Shall I repeat it once more? I think that the call overhead should be
> perhaps better optimised, and if it seems difficult, I would like to
> know why. What's surprising in my question? I am not a flea on your
> nose, Maestro.

If you want to understand the Python function call overhead, and
perhaps try your hand at optimizing it, there's really no possible
substitute for studying the freely available Python sources.  If you
manage to shave off 10% reliably off function call overheads, without
bloating the Python interpreter with huge complexity, your patches
will no doubt be warmly accepted.

And they'll still leave DSU a MOST desirable technique -- remember
the speedup with DSU is a factor of 5 or 10, that is, 500% to 1000%,
so optimizing function calls by 10% or even 20% will barely touch it...

>> Suppose there was no call overhead at all, just a very
>> complicated comparison.  The effect would be exactly the same,
>> no?  Appliying a complicated comparison function must happen
>> N logN times; applying an equally complicated transformation
>> function that makes the items rapidly comparable only must
>> happen N times, and thus becomes irrelevant for large N.
> 
> Nothing to do with my questions.
> Moreover, in  complicated, interesting cases you won't be
> able to find this transformation producing fast comparisons in
> linear time anyway, so this speculation is not so interesting.

It's not speculation AT ALL -- it's a frequent practical case!

Consider the case of strcoll vs strxfrm, for example -- as per
ANSI/ISO 9899-1990 standard.  Who ever said we have to "find
this transformation" at runtime?!  In many, frequent cases of
huge practical importance, the transformation is given.  DSU
is about how to use the transormation (in a time-space tradeoff)
rather than the comparison.  It also works fine in functional
programming, BTW, even though I don't think Rabhi and Lapalme
mention it (but then, their excellent book IS quite a thin one,
not an exhaustive tome of "all there's to possibly know" the
way Knuth's thick volumes aim to be) -- there's nothing inherently
procedural about it.

Since you seem to have something _against_ DSU (claiming that
it shouldn't be used, that time/space tradeoffs for optimization
are "wastes", etc), I think it's very relevant to defend this
technique, no matter how loud you keep shouting it has nothing
to do with your "questions" (they're more assertions than questions,
of course).

> The following paragraph neither, I am still asking about the
> function call overhead...

The subject mentions sorting.  If you want to understand
Python's call overhead, and possibly see what might be
done to reduce it, you're best advised to open a separate
specific thread.  You're ALSO best advised to study the
sources -- and you'll find plenty of willing helpers if
you show that you're equally willing to do your homework,
rather than being here for the purpose of flamewars.

>> But the same effect will apply in ANY language for the
>> appropriate kind of sorting task if the comparison op
>> you need grows rich and complicated enough to matter.
>> 
>> Assuming the language's sort has perfect optimization
>> and the language makes it very easy to build, handle
>> and dismantle sequences of tuples, of course:-).  But
>> these are hardly DEFECTS, are they?-)
> 
> What does it mean: "language's sort has perfect optimization"?

It means: it comprises absolutely no wasted effort, but only
the minimal, statistically irreducible amount.

> It is no magic.

I'm not talking about magic (e.g., "oracles" in the algorithms
theory sense), just about perfectly real optimization issues
of very substantial practical importance.

>> > I found it a bit ridiculous to pollute the
>> > memory with an ephemeric sequence just because there is something lousy
>> > with parametrized sort
> 
>> I think you're using loaded language, presumably reflecting
>> loaded and not-well-considered sentiments.
> 
> Thank you. I am always ready to learn how to behave correctly from
> serious, elegant people, full of sparkling humor, and who show invariably
> their full respect for all people in this newsgroup they address to.

You're welcome.  I'm always careful to show respect in exact
proportion to what is deserved by the other person's behavior
on the group, and it's heartwarming to know I've succeeded
so well in this case.

>> There's nothing "lousy" with functions in Python -- they just take
>> measurable time to call, that's all.  Most often one doesn't care all
>> that much about performance.
> 
> In the name of *who* you are claiming this?

In the name of typical users of Python.  Programmers in general are
obsessed with micro-optimization issues, but Pythonistas seem, again
in general, to have reached a more highly enlightened state, and
learned not to "sweat the small stuff".  Elegance, maintainability,
clarity, programmer productivity, are all considerations that by far
dominate issues of performance in, say, at least 9 cases out of 10
(but 8 out of 10 would be quite sufficient to claim "most often").

Sorting large sets of data is a good candidate for an exception, as
so often such a sort can become an application's bottleneck, unless
some attention is paid to the matter.

> Almost all creators of almost all programming languages care a lot about
> efficiency, about removing redundancies and optimizing frequent
> operations. 

Creators of *languages*, as opposed to creators of *implementations*,
would be rather ill-advised if they gave excessive weight to performance
issues.  Some languages have been designed (or maybe, have "grown") in
just such a mindset -- the result, almost inevitably, is the complexity
and occasional inelegance of C++, where implementation issues have
often been allowed to dominate language-design decisions.

The *implementers* of a language will weigh efficiency (of generated,
or interpreted, code) against many other issues, of course -- or else
you wouldn't see (e.g.) interpreters for languages that can quite
well be compiled down to machine-code, as previously mentioned.

> Higher-order functions, and parameterized functionals are
> absolutely fundamental in functional programming, and the compiler
> optimisations are pushed into extreme there. In OO languages all the
> caching mechanisms and some other ways of accelerating the dispatch of
> virtual methods are also quite important. Python is no exception. It is
> already very good, sure, and still improving, so I see no reason not to
> see a place which I think to be fragile (if you prefer that than the word
> "lousy").

"fragile" is just totally inappropriate when referring to performance -- I
suspect you're not using your mother tongue here (but then again, neither
am I, so, we're even).

Python is indeed an exception in always having put simplicity very, VERY
high up on the scale of values.  I like the results a lot.  Sure, I don't
mind if and when some operation gets optimized, as long as it's done
with the prudence that has always characterized Python's development.

Do study the Python internals, and find out where the moderate amount
of complexity in them is invested.  Parsing (for reasons that I don't
really understand, to be honest -- I don't know why the parsing, which
is hardly a key bottleneck, isn't done in the simplest possible way), 
dictionaries, sorting, and a very few other major hotspots -- small
areas of code that have disproportionate impact on overall performance,
just as could be predicted.

Do, in particular, look for the complications used to "push into
extreme ... the optimizations" for any of the cases you mention.  Then,
when you're done looking, and I suspect having come up with scant
findings, feel free to experiment in inserting whatever complications
you believe would make a major difference.  I'm pretty confident
that nothing will even begin to _scratch_ the 1000% advantages so
easily measured for DSU, of course:-).

> I appreciate your contribution to the Cookbook and I understand that
> you will defend DSU.
> 
>> That's not "polluting the memory with an ephemeric sequence" --
>> it's investing some memory (not much -- O(N) memory, peanuts!)
>> to reduce computation.  A hallowed, time-honored approach, you know.
> 
> It is your philosophy but there are others. For some people space matters,
> you know.

It's not a matter of "for some people" -- the way you put it, you do
make it seem a psychopathological issue, you know.  If you don't have
the space to invest, then you don't -- that doesn't depend on "people"
as much as on what size of program is running on what size of machine.

In a vast majority of cases, O(N) -- typically with a reasonably
small multiplicative constant -- is a reasonable and available amount
of space to invest to speed up by (say) 10 times an O(N log N) sort.
The space is taken only during the sort itself, after all: if it's
there at all, it's a good way to use it.

When you sort data sets that just push the boundaries of your currently
available space, there may be special cases where the sort per se
would work -- there is JUST enough space for that -- but the extra
O(N) memory is the straw that breaks the camel's back.

They're quite special cases because when you need to sort just a bit
more data than that again, then you can't fit the data themselves
in memory any longer, together with program and working space.  Then,
you use other sorting techniques -- see my example of external mixed
mergesort using generators in the "iterators and generators" tutorial
I gave at Europython, 
http://europython.zope.nl/sessions/presentations/PythonTutorials/EPC2002-Martelli-IteratorsGenerators.pdf
(one single long URL -- rejoin if split by news / mail software).

I have no special interest in that narrow zone where the data fit
but the extra O(N) space isn't available -- if I can't be confident
of having the space for a normal DSU, I'll jump right over to some
sort approach where most of the data can reside on-disk instead.
If for some strange and wondrous reason you need to target exactly
that narrow zone, OK, but you can hardly expect special effort from
the point of view of a general-purpose language and library to help
you target that very special case.

Alex