sorting many arrays from one...

Tim Peters tim.one at comcast.net
Mon Jul 15 03:47:19 EDT 2002


[Alex Martelli, on list.sort() special cases]
> Right.  The only extra case I've often wished it handled (defining
> "often" as "more than once":-) is when the FIRST element is out of
> order and all the rest are OK.  That happens in mergesorting huge
> streams,

Is the number of streams also large?  Just curious.

> in the core merge loop:
>
>     while streams:
>         yield streams[0][0]
>         try:
>             streams[0][0] = streams[0][2]()
>         except StopIteration:
>             del streams[0]
>         else:
>             streams.sort()  # this is the first-maybe-OOO sort

Understood, and that's reasonable.  I expect the best a general sort method
can do with this is worst-case N+log2(N) comparisons, give or take 1 or 2,
where N=len(streams), the bulk of that because it has to establish that only
one is out of order (it takes N-1 comparisons just to establish that an
already-sorted list is already sorted).  I have in mind a simple way to
achieve that, so hold your breath <wink>.

By arranging the streams in a min-heap yourself, you can guarantee
worst-case log2(N) comparisons.  If the number of streams can be large,
that's a monster-big improvement; it's already a factor-of-3 improvement
over N+log2(N) for N as small as 7.

> The "streams" auxiliary list is originally built from the sequence X
> of already-sorted streams as:
>
> streams = [ [X[i].next(), i, X[i].next] for i in range(len(X)) ]

Ooh!  In 2.3 you'll be able to say

    streams = [[s.next(), i, s.next] for (i, s) in enumerate(X)]

It grows on you.

> Normally N==len(streams) isn't all that big, and I COULD make each
> leg of the "while streams:" loop O(N) anyway by using knowledge
> about _which_ special cases method sort is so good at -- changing
> the else clause into:
>
>         else:
>             streams.append(streams.pop(0))
>             streams.sort() # the maybe-OOO item is now *LAST*...
>
> but that's not all that pleasant -- I'd do it only in emergencies...

That would indeed get you out with N+log2(N) comparisons, N of which you
know aren't needed, but which list.sort() can't know aren't needed.  If it
is an emergency someday <wink>, I'd try

        else:
            s = streams.pop(0)
            bisect.insort(streams, s)

instead.  The smallest N for which that pays depends on how expensive
element comparisons are, but the crossover point is generally small.  The
main loop can also be recoded naturally with less fiddly chained indexing
then:

     insert = bisect.insort
     while streams:
         value, i, fetch = streams.pop(0)
         yield value
         try:
             value = fetch()
         except StopIteration:
             pass
         else:
             insert(streams, [value, i, fetch])

This kind of thing goes faster if streams is made a list of 3-tuples instead
of 3-lists, if you're concerned about micro-optimization here.






More information about the Python-list mailing list