[Tutor] Problem When Iterating Over Large Test Files

Thu Jul 19 09:06:23 CEST 2012

> I count only 19 lines.

yep, you are right.  My bad, I think I missing copy/pasting line 20.

>The first group has only three lines. See below.

Not so, the first group is actually the first four lines listed below.
 Lines 1-4 serve as one group.  For what it is worth, line four should
have 1 character for each char in line 1, and the first line is much
shorter, contains a space, and for this file always ends in either
"1:N:0:" (keep) "1"Y"0:" (remove).   The EXAMPLE data is correctly
formatted as it should be, but I'm missing line 20.

> There is a blank line, which I take as NOT part of the input but just a
> spacer. Then:
>
> 1) Line starting with @
> 2) Line of bases CGCGT ...
> 3) Plus sign
> 4) Line starting with @@@
> 5) Line starting with @
> 6) Line of bases TTCTA ...
> 7) Plus sign
>
> and so on. There are TWO lines before the first +, and three before each
> of the others.

I think you are just reading one frame shifted, its not a well
designed format because the required start character "@", can appear
other places as well....

>
>
>> __EXAMPLE RAW DATA FILE REGION__
>>
>> @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0:
>> CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC
>> +
>> @@@DDADDHHHHHB9+2A<??:?G9+C)???G at DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5 at CDAC(5(5:5,(8?88?BC@#########
>> @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0:
>> TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA
>> +
>> @CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI at GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4 at ADCA>>?BBBDDABB055<>-?A<B1:@ACC:
>> @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
>> CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA
>> +
>> CCCFFFFFHHHHHIJIEHIH at AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE at AACC<ACDB;;B?C3>A>AD>BA
>> @HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0:
>> ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC
>> +
>> CCCFFFFFHHHHHIDHJIIHIIIJIJIIJJJJGGIIFHJIIGGGGIIEIFHFF>CBAECBDDDC:??B=AAACD?8@:>C@?8CBDDD at D99B@>3884>A
>> @HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0:
>> CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC
>> +

>
> Your code says that the first line in each group should start with an @
> sign. That is clearly not the case for the last two groups.
>
> I suggest that your data files have been corrupted.

I'm pretty sure that my raw IN files are all good, its hard to be sure
with such a large file, but the very picky downstream analysis program
takes every single raw file just fine (30 of them), and gaks on my
filtered files, at regions that don't conform to the correct
formatting.

>
>> __PYTHON CODE __
>
> I have re-written your code slightly, to be a little closer to "best
> practice", or at least modern practice. If there is anything you don't
> understand, please feel free to ask.
>
> I haven't tested this code, but it should run fine on Python 2.7.
>
> It will be interesting to see if you get different results with this.

--CODE REMOVED--

Thanks, for the suggestions.  I've never really felt super comfortable
using objects at all, but its what I want to learn next.  This will be
helpful, and useful.

 > for reads, lines in four_lines( INFILE ):
                ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines

Can you explain what is going on here, or point me In the right
direction?  I see that the parts of 'lines' get assigned, but I'm
missing how the file gets iterated over and how reads gets
incremented.

Do you have a reason why this approach might give a 'better' output?

Thanks again.