[Tutor] Question about nested FOR looping [an implementation
of a 'mergeiter' merging iterator function]
Wed Apr 30 17:45:02 2003
On Wed, 30 Apr 2003, Michael Janssen wrote:
> On Wed, 30 Apr 2003 firstname.lastname@example.org wrote:
> > So, as a follow-up out of curiosity, which is the most efficient:
> > 1) Read the two files into lists and then process the lists, or
> with (vary) large files this is forbidden because of memory-consumption.
> With reasonable small files it can be handy espessially if you need random
> access (not looping line-by-line).
> > 2) Process directly from the file reads, using seek in the nested FOR, or
> *can* be good for memory. NB: "for line in fo.readlines():" build a
> (possibly) huge list of lines within memory. fo.xreadlines() is better or
> recent-version-of-python "for line in fo:"
But just to mention: using seek() is not necessary. If we're doing a
merge between two sequences, and if the two sequences are themselves
already sorted, we can do a simulaneous linear pass through both sequences
just once, and avoid the multiple scan.
Since you already have a good dictionary solution to your problem, I feel
it should be ok to post an implementation of merging for you. Here's a
module that applies a merging between two iteratable objects, using that
left-hand/right-hand process we talked about earlier:
"""mergeiter.py: An extended example of generators in action. Provides a
function called mergeiter that merges two iterators together.
Danny Yoo (email@example.com)
from __future__ import generators
def mergeiter(i1, i2, cmp=cmp):
"""Returns the "merge" of i1 and i2. i1 and i2 must be iteratable
objects, and we assume that i1 and i2 are both individually sorted.
left, right = ExtendedIter(i1), ExtendedIter(i2)
if not left.has_next():
while 1: yield ('r', right.next())
if not right.has_next():
while 1: yield ('l', left.next())
comparison = cmp(left.peek(), right.peek())
if comparison < 0:
yield ('l', left.next())
elif comparison == 0:
right.next() ; yield ('=', left.next())
yield ('r', right.next())
"""An extended iterator that wraps around an existing iterators.
It provides extra methods:
has_next(): checks if we can still yield items.
peek(): returns the next element of our iterator, but doesn't
pass by it."""
def __init__(self, i):
self._myiter = iter(i)
self._next_element = None
self._has_next = 0
"""Returns true if we can call next() without raising a
"""Nonexhaustively returns the next element in our iterator."""
"""Returns the next element in our iterator."""
if not self._has_next:
result = self._next_element
"""Private function to initialize the states of
self._next_element and self._has_next. We poke our
self._myiter to see if it's still alive and kicking."""
self._next_element = self._myiter.next()
self._has_next = 1
self.next_element = None
self._has_next = 0
for item in mergeiter([2, 4, 6, 8], [1, 3, 4, 7, 9, 10]):
if __name__ == '__main__':
f1, f2 = open(sys.argv, sys.argv)
for item in mergeiter(f1, f2):
Here's a sample of it in action:
>>> import mergeiter
The program uses a constant amount of memory, regardless of the size of
the files, and should run in linear time since the code does a single pass
through both files.
The code might seem a little mysterious because of the generators stuff.
Generators are a new feature in Python 2.2, and I've been having way too
much fun with them. *grin* You can find out more about generators here:
If this mergeiter() function is useful enough, and if speed is a real
concern, we can show how one can recode this as a C extension. *grin*
I hope this helps!