[Tutor] Question about nested FOR looping [an implementation
of a 'mergeiter' merging iterator function]
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Wed Apr 30 17:45:02 2003
On Wed, 30 Apr 2003, Michael Janssen wrote:
> On Wed, 30 Apr 2003 stuart_clemons@us.ibm.com wrote:
>
> > So, as a follow-up out of curiosity, which is the most efficient:
> >
> > 1) Read the two files into lists and then process the lists, or
>
> with (vary) large files this is forbidden because of memory-consumption.
> With reasonable small files it can be handy espessially if you need random
> access (not looping line-by-line).
>
> > 2) Process directly from the file reads, using seek in the nested FOR, or
>
> *can* be good for memory. NB: "for line in fo.readlines():" build a
> (possibly) huge list of lines within memory. fo.xreadlines() is better or
> recent-version-of-python "for line in fo:"
Hi everyone,
But just to mention: using seek() is not necessary. If we're doing a
merge between two sequences, and if the two sequences are themselves
already sorted, we can do a simulaneous linear pass through both sequences
just once, and avoid the multiple scan.
Since you already have a good dictionary solution to your problem, I feel
it should be ok to post an implementation of merging for you. Here's a
module that applies a merging between two iteratable objects, using that
left-hand/right-hand process we talked about earlier:
###
#!/usr/bin/env python
"""mergeiter.py: An extended example of generators in action. Provides a
function called mergeiter that merges two iterators together.
Danny Yoo (dyoo@hkn.eecs.berkeley.edu)
"""
from __future__ import generators
def mergeiter(i1, i2, cmp=cmp):
"""Returns the "merge" of i1 and i2. i1 and i2 must be iteratable
objects, and we assume that i1 and i2 are both individually sorted.
"""
left, right = ExtendedIter(i1), ExtendedIter(i2)
while 1:
if not left.has_next():
while 1: yield ('r', right.next())
if not right.has_next():
while 1: yield ('l', left.next())
comparison = cmp(left.peek(), right.peek())
if comparison < 0:
yield ('l', left.next())
elif comparison == 0:
right.next() ; yield ('=', left.next())
else:
yield ('r', right.next())
class ExtendedIter:
"""An extended iterator that wraps around an existing iterators.
It provides extra methods:
has_next(): checks if we can still yield items.
peek(): returns the next element of our iterator, but doesn't
pass by it."""
def __init__(self, i):
self._myiter = iter(i)
self._next_element = None
self._has_next = 0
self._prime()
def has_next(self):
"""Returns true if we can call next() without raising a
StopException."""
return self._has_next
def peek(self):
"""Nonexhaustively returns the next element in our iterator."""
assert self.has_next()
return self._next_element
def next(self):
"""Returns the next element in our iterator."""
if not self._has_next:
raise StopIteration
result = self._next_element
self._prime()
return result
def _prime(self):
"""Private function to initialize the states of
self._next_element and self._has_next. We poke our
self._myiter to see if it's still alive and kicking."""
try:
self._next_element = self._myiter.next()
self._has_next = 1
except StopIteration:
self.next_element = None
self._has_next = 0
def _test():
for item in mergeiter([2, 4, 6, 8], [1, 3, 4, 7, 9, 10]):
print item
if __name__ == '__main__':
## _test()
f1, f2 = open(sys.argv[1], sys.argv[2])
for item in mergeiter(f1, f2):
print item
###
Here's a sample of it in action:
###
>>> import mergeiter
>>> mergeiter._test()
('r', 1)
('l', 2)
('r', 3)
('=', 4)
('l', 6)
('r', 7)
('l', 8)
('r', 9)
('r', 10)
###
The program uses a constant amount of memory, regardless of the size of
the files, and should run in linear time since the code does a single pass
through both files.
The code might seem a little mysterious because of the generators stuff.
Generators are a new feature in Python 2.2, and I've been having way too
much fun with them. *grin* You can find out more about generators here:
http://www.python.org/peps/pep-0255.html
If this mergeiter() function is useful enough, and if speed is a real
concern, we can show how one can recode this as a C extension. *grin*
I hope this helps!