[Tutor] Problem When Iterating Over Large Test Files

Steven D'Aprano steve at pearwood.info
Fri Jul 20 11:42:06 CEST 2012


Ryan Waples wrote:
>> I count only 19 lines.
> 
> yep, you are right.  My bad, I think I missing copy/pasting line 20.
> 
>> The first group has only three lines. See below.
> 
> Not so, the first group is actually the first four lines listed below.
>  Lines 1-4 serve as one group.  For what it is worth, line four should
> have 1 character for each char in line 1, and the first line is much
> shorter, contains a space, and for this file always ends in either
> "1:N:0:" (keep) "1"Y"0:" (remove).   The EXAMPLE data is correctly
> formatted as it should be, but I'm missing line 20.


Ah, I had somehow decided that the + was a group delimiter. Which would make 
more sense than having an (apparently) arbitrary plus sign in line 3.

The more information you can supply about the format, the better. Perhaps 
someone will even come up with a standard parser for these files, so that 
every biomed researcher doesn't have to re-invent the wheel every time they 
open one of these files.


[...]
> I think you are just reading one frame shifted, its not a well
> designed format because the required start character "@", can appear
> other places as well....

Yes, likely I am reading it shifted, and no, it is not a well-designed format.


> I'm pretty sure that my raw IN files are all good, its hard to be sure
> with such a large file, but the very picky downstream analysis program
> takes every single raw file just fine (30 of them), and gaks on my
> filtered files, at regions that don't conform to the correct
> formatting.

All I can suggest is that you add more error-checking to your code. For each 
line, or at least for each group of lines, check that the format is as you 
expect, both before and after you write the data.

You might also like to do a disk-check of the disk in question, just in case 
it is faulty and corrupting the data (very unlikely, but it wouldn't hurt to 
check).


>  > for reads, lines in four_lines( INFILE ):
>                 ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines

Argggh! I screwed that up. Sorry, I missed an extra call. It should be:


     for reads, lines in enumerate(four_lines( INFILE )):
                 ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines


Sorry about that.


> Can you explain what is going on here, or point me In the right
> direction?  I see that the parts of 'lines' get assigned, but I'm
> missing how the file gets iterated over and how reads gets
> incremented.

There are four things you need to know:


(1) File objects in Python are "iterators" -- the idea is that certain objects 
obey a protocol that says, each time you call the next() function on them, 
they hand over one piece of data, and then advance to the next value. File 
objects are one such iterator: each time you call next(file_object), you get 
one line of text, until EOF.

For-loops naturally work with iterators: under the hood, they repeatedly call 
next() on the object, until the iterator signals it is done.

So instead of the old way:

f = open("myfile", "r")
# grab all the lines at once, if you can
lines = f.readlines()  # hope you don't run out of memory...
for line in lines:
     do_something_with(line)


the new way uses less memory and is safer:

f = open("myfile", "r")
for line in f:  # file objects iterate over lines
     do_something_with(line)


(2) But what I do is I create my own iterator, called "four_lines", using what 
Python calls a generator. A generator is a special type of function which 
behaves as an iterator: instead of returning a single value and stopping, a 
generator can return multiple values, one at a time, each time you call next() 
on it. The presence of "yield" instead of "return" makes a generator.

So four_lines() is a generator which takes a file object as argument. It 
repeatedly does the following steps:

   a) grab one line from the file object; if that works, great,
      otherwise signal EOF and we're done;

   b) grab three more lines from the file object, but this time
      don't signal EOF, just pad the lines with empty strings;

   c) package those four lines into a tuple of four values and
      yield them (like a return, only the function doesn't exit yet);

   d) surrender control back to the calling code, in this case
      the for-loop;

   e) wait for the next loop, and go back to step 1.

It does this repeatedly until the file object signals EOF, then this also 
signals EOF.


(3) enumerate() is yet another iterator. (You may notice that Python is really 
into iterators and processing data when needed, instead of up-front in a 
list.) What enumerate does is take another iterator as argument, in this case 
the output of four_lines, and simple package that up with a count of how many 
times you've called it. So instead of keeping your own loop variable, 
enumerate does it for you.

An example:

py> for i, x in enumerate(['fe', 'fi', 'fo', 'fum']):
...     print(i, x)
...
0 fe
1 fi
2 fo
3 fum


Notice that the count starts at 0 instead of 1. If you don't like that, just 
add one to the result.

The end result is that I chain a sequence of iterator calls:

- the INPUT file object iterator provides lines of text, one per call;

- the four_lines iterator accumulates the lines of text from INPUT, strips 
them of whitespace, and provides them in groups of four per call;

- the enumerate iterator grabs each group from four_lines, and provides a loop 
counter and the group it just grabbed;

- finally the for-loop assigns the loop counter to i and the group of lines to 
lines.


(4) Last but not least: the second of the loop variables is called "lines", 
because it holds the group of four lines provided by four_lines. So the first 
thing the for-loop block does is grab those four lines and assign them all to 
names:

ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines


Here's a simple example of how this works:

py> group = ('alpha', 'beta', 'gamma')
py> a, b, c = group
py> a
'alpha'
py> b
'beta'
py> c
'gamma'


This is sometimes called "tuple unpacking", or "sequence unpacking".



> Do you have a reason why this approach might give a 'better' output?

If there was a bug in your code, my code would be unlikely to contain the same 
bug and so should give a better -- or at least different -- result.




-- 
Steven



More information about the Tutor mailing list