Merging sorted lists/iterators/generators into one stream of values...

Alex Martelli aleax at mail.comcast.net
Mon Oct 10 10:34:50 EDT 2005


Mike C. Fletcher <mcfletch at rogers.com> wrote:
   ...
> One thing to keep in mind (if you care about performance) is that you
> one could use bisect, instead of sort, as the sorted list of streams is
> already in order save for the one element you are processing.  Btw, nice
> trick with reverse to reduce memory copying, when did that get 
> introduced?

Python 2.4

>  Wonder if bisect can deal with reverse-sorted elements.  

Not AFAIK.

> Anyway, should reduce the O-complexity of that part of the operation,
> though you'll still have to do a memcpy to shift the rest of the source
> list's array, and if it can't deal with reverse-sorted lists it would
> move you back to front-of-list popping.

Yeah, if you have to pop the front you're O(N) anyway, which is
basically what timsort does with nearly-ordered lists (though timsort
uses O(N) _comparisons_ while bisecting would use O(N) _moves_).


> Oh, we're still missing use of a comparison function in both versions.

Trivial, just have the functions accept a key= and use that to prepare
the auxiliary list they're using anyway.

> I'd think you'd want that functionality *available* if you're going to
> make this a general tool.  You'll also need to check for StopIteration
> on creation of sources for null sequences.

True, the functions as prepared don't accept empty streams (exactly
because they don't specialcase StopIteration on the first calls to
next).  Pretty trivial to remedy, of course.

> Finally, why  the 'i' 
> element?  It's never used AFAICS.

It's used by the list's lexicographic comparison when the first elements
of two lists being compared are equal (which can happen, of course) to
avoid comparing the last elements (which are methods... no big deal if
they get compared, but it makes no logical sense).

So, an example enhanced merge_by_sort:

 
def merge_by_sort(streams, key=None):

  if not key: key = lambda x: x
  sources = []
  for i, s in enumerate(streams):
    try: first_item = s.next()
    except StopIteration: pass
    else: sources.append((key(item), i, item, s.next))

  while sources:
    sources.sort(reverse=True)
    best_source = sources[-1]
    yield best_source[2]
    try: best_source[2] = best_source[-1]()
    except StopIteration: sources.pop()
    else: best_source[0] = key(best_source[2])


Of course, since the sort method DOES accept a key= parameter, this
could be simplified, but I'll leave it like this to make it trivial to
see how to recode the merging by heap as well (in 2.4)...


Alex




More information about the Python-list mailing list