[Tutor] Reading & printing lines from two different files

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon, 10 Jun 2002 12:07:06 -0700 (PDT)


[Warning: this generator stuff depends on Python 2.2; we could mimic it
with Python 2.1, but it takes a little more work.]


> def generate_line(filename):
>     thisfile = open(filename, 'r')
>     for line in thisfile:
>         yield line

Actually, we can get this one for free --- Files implement the "iterators"
interface already, so we can say:

###
def generate_line(filename):
    return iter(open(file))
###

What this returns us is an "iterator", something that marches through, a
line at a time.  Here's an example that shows what a file iterator looks
like:

###
>>> file_iter = iter(open('/usr/share/dict/words'))
>>> file_iter
<xreadlines.xreadlines object at 0x8151760>
>>> file_iter.next()
'Aarhus\n'
>>> file_iter.next()
'Aaron\n'
>>> file_iter.next()
'Ababa\n'
>>> file_iter.next()
'aback\n'
>>> file_iter.next()
'abaft\n'
###


> while 1:
>     try:
>         print gen_one.next()
>         print gen_two.next()
>     except StopIteration:
>         break
>
> ==================================
>
> This does work here, tho it seems like an awful lot of code compared with
> your zip-and-readlines combination. Is there any simple way to explain why
> this is not such a memory hog?

readlines() sucks all of the lines into memory all at once, saving these
lines in a list.  Usually, this is a good thing, because it allows us to
look at any particular line --- we could look at line 42, or line 1009, or
line 4, or ... without having to do anything more than a simple list
element access.  What we get is the ability to "randomly-access" any line
in a file, and that's quite convenient.


But there is a cost to using readlines(): we load the whole file into
memory.  This becomes an issue if we're dealing with huge text files.
When we use iterators, we tell Python to sequentially march through our
sequence.  No more random access, but on the other hand, we only read a
line at a time.  So that's where the savings come in.



Here's a way of having the best of both works: making this look nice, and
having it be efficient too:

###
def zipiter(*sequences):
    """A generator version of the zip() function."""
    sequence_iters = [iter(seq) for seq in sequences]
    while 1:
        next_row = [seq_iter.next() for seq_iter in sequence_iters]
        yield tuple(next_row)


def printAlternatingLines(file1, file2):
    for (line1, line2) in zipiter(file1, file2):
        print line1
        print line2
###

(Warning: I have not tested this code yet.  I know this is going to bite
me, so I'll double check this tonight to make sure it works.)




> One thing I'm not sure of in the example above is where to put the
> thisfile.close() line to close those files again. My computer doesn't
> seem any worse for not closing them, but it does seem like bad
> manners...

We can leave it off in many cases, since Python will garbage collect files
that aren't accessible.  However, if we're writing some content into a
file, closing the file explicitely is a good idea, just to make sure our
mess is cleaned up.  *grin*

(Actually, the Jython variant of Python requires explicit file closing()
when we write.  See:

    http://www.jython.org/cgi-bin/faqw.py?req=show&file=faq03.008.htp

for more details.)



Hope this helps!