[Tutor] Question about nested FOR looping [an implementation of a 'mergeiter' merging iterator function]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Wed Apr 30 17:45:02 2003


On Wed, 30 Apr 2003, Michael Janssen wrote:

> On Wed, 30 Apr 2003 stuart_clemons@us.ibm.com wrote:
>
> > So, as a follow-up out of curiosity, which is the most efficient:
> >
> > 1) Read the two files into lists and then process the lists, or
>
> with (vary) large files this is forbidden because of memory-consumption.
> With reasonable small files it can be handy espessially if you need random
> access (not looping line-by-line).
>
> > 2) Process directly from the file reads, using seek in the nested FOR, or
>
> *can* be good for memory. NB: "for line in fo.readlines():" build a
> (possibly) huge list of lines within memory. fo.xreadlines() is better or
> recent-version-of-python "for line in fo:"

Hi everyone,


But just to mention: using seek() is not necessary.  If we're doing a
merge between two sequences, and if the two sequences are themselves
already sorted, we can do a simulaneous linear pass through both sequences
just once, and avoid the multiple scan.


Since you already have a good dictionary solution to your problem, I feel
it should be ok to post an implementation of merging for you. Here's a
module that applies a merging between two iteratable objects, using that
left-hand/right-hand process we talked about earlier:


###
#!/usr/bin/env python

"""mergeiter.py: An extended example of generators in action.  Provides a
function called mergeiter that merges two iterators together.

Danny Yoo (dyoo@hkn.eecs.berkeley.edu)
"""


from __future__ import generators


def mergeiter(i1, i2, cmp=cmp):
    """Returns the "merge" of i1 and i2.  i1 and i2 must be iteratable
    objects, and we assume that i1 and i2 are both individually sorted.
    """
    left, right = ExtendedIter(i1), ExtendedIter(i2)
    while 1:
        if not left.has_next():
            while 1: yield ('r', right.next())
        if not right.has_next():
            while 1: yield ('l', left.next())
        comparison = cmp(left.peek(), right.peek())
        if comparison < 0:
            yield ('l', left.next())
        elif comparison == 0:
            right.next() ; yield ('=', left.next())
        else:
            yield ('r', right.next())


class ExtendedIter:
    """An extended iterator that wraps around an existing iterators.
    It provides extra methods:

        has_next(): checks if we can still yield items.

        peek(): returns the next element of our iterator, but doesn't
                pass by it."""
    def __init__(self, i):
        self._myiter = iter(i)
        self._next_element = None
        self._has_next = 0
        self._prime()


    def has_next(self):
        """Returns true if we can call next() without raising a
        StopException."""
        return self._has_next


    def peek(self):
        """Nonexhaustively returns the next element in our iterator."""
        assert self.has_next()
        return self._next_element


    def next(self):
        """Returns the next element in our iterator."""
        if not self._has_next:
            raise StopIteration
        result = self._next_element
        self._prime()
        return result


    def _prime(self):
        """Private function to initialize the states of
        self._next_element and self._has_next.  We poke our
        self._myiter to see if it's still alive and kicking."""
        try:
            self._next_element = self._myiter.next()
            self._has_next = 1
        except StopIteration:
            self.next_element = None
            self._has_next = 0




def _test():
    for item in mergeiter([2, 4, 6, 8], [1, 3, 4, 7, 9, 10]):
        print item


if __name__ == '__main__':
    ##  _test()
    f1, f2 = open(sys.argv[1], sys.argv[2])
    for item in mergeiter(f1, f2):
        print item
###


Here's a sample of it in action:

###
>>> import mergeiter
>>> mergeiter._test()
('r', 1)
('l', 2)
('r', 3)
('=', 4)
('l', 6)
('r', 7)
('l', 8)
('r', 9)
('r', 10)
###


The program uses a constant amount of memory, regardless of the size of
the files, and should run in linear time since the code does a single pass
through both files.

The code might seem a little mysterious because of the generators stuff.
Generators are a new feature in Python 2.2, and I've been having way too
much fun with them.  *grin* You can find out more about generators here:

    http://www.python.org/peps/pep-0255.html



If this mergeiter() function is useful enough, and if speed is a real
concern, we can show how one can recode this as a C extension.  *grin*



I hope this helps!