[Tutor] Problem When Iterating Over Large Test Files
Ryan Waples
ryan.waples at gmail.com
Thu Jul 19 01:33:20 CEST 2012
I'm seeing some unexpected output when I use a script (included at
end) to iterate over large text files. I am unsure of the source of
the unexpected output and any help would be much appreciated.
Background
Python v 2.7.1
Windows 7 32bit
Reading and writing to an external USB hard drive
Data files are ~4GB text (.fastq) file, it has been uncompressed
(gzip). This file has no errors or formatting problems, it seems to
have uncompressed just fine. 64M lines, each 'entry' is split across
4 consecutive lines, 16M entries.
My python script iterates over data files 4 lines at a time, selects
and writes groups of four lines to the output file. I will end up
selecting roughly 85% of the entries.
In my output I am seeing lines that don't occur in the original file,
and that don't match any lines in the original file. The incidences
of badly formatted lines don't seem to match up with any patterns in
the data file, and occur across multiple different data files.
I've included 20 consecutive lines of input and output. Each of these
5 'records' should have been selected and printed to the output file.
But there is a problem with the 4th and 5th entries in the output, and
it no longer matches the input as expected. For example the line:
TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT
never occurs in the original data.
Sorry for the large block of text below.
Other pertinent info, I've tried a related perl script, and ran into
similar issues, but not in the same places.
Any help or insight would be appreciated.
Thanks
__EXAMPLE RAW DATA FILE REGION__
@HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0:
CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC
+
@@@DDADDHHHHHB9+2A<??:?G9+C)???G at DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5 at CDAC(5(5:5,(8?88?BC@#########
@HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0:
TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA
+
@CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI at GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4 at ADCA>>?BBBDDABB055<>-?A<B1:@ACC:
@HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA
+
CCCFFFFFHHHHHIJIEHIH at AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE at AACC<ACDB;;B?C3>A>AD>BA
@HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0:
ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC
+
CCCFFFFFHHHHHIDHJIIHIIIJIJIIJJJJGGIIFHJIIGGGGIIEIFHFF>CBAECBDDDC:??B=AAACD?8@:>C@?8CBDDD at D99B@>3884>A
@HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0:
CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC
+
__EXAMPLE PROBLEMATIC OUTPUT FILE REGION__
@HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0:
CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC
+
@@@DDADDHHHHHB9+2A<??:?G9+C)???G at DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5 at CDAC(5(5:5,(8?88?BC@#########
@HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0:
TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA
+
@CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI at GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4 at ADCA>>?BBBDDABB055<>-?A<B1:@ACC:
@HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA
+
CCCFFFFFHHHHHIJIEHIH at AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE at AACC<ACDB;;B?C3>A>AD>BA
TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT
+
BCCFFDFFHHHHFIJIJJHIFGIIIIGGGIGGIJIJIGIGIGIGHHIGIIJGJJJIIJIIEHIHHHFFFB@>CCE at BEDCDDAC?CC?ACC??<C>>ADDD
@HWI-ST0747:167:B02DEACXX:8:1304:19473:44548 1:N:0:
CTACAGTGCAGGCACCCGGCCCGCCACAATGAGTCGCTAGAGCGCAATGAGACAAGTAAAGCTCCCCGACCAAACCCTCCCCTAACCCGGACGATGCTGGG
+
BCCFFFFFHHHHHIJEHJJIIGIJIJJJJGIJIDHDGIGJJJJIGGFFFFEEEEED at CCDDC>C>BBD?BDBAABBBBBDDD at BCD@?@BDBDDDBDCCC2
__PYTHON CODE __
import glob
my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq')
for each in my_in_files:
#print(each)
out = each.replace('/gzip', '/rem_clusters2' )
#print (out)
INFILE = open (each, 'r')
OUTFILE = open (out , 'w')
# Tracking Variables
Reads = 0
Writes = 0
Check_For_End_Of_File = 0
#Updates
print ("Reading File: " + each)
print ("Writing File: " + out)
# Read FASTQ File by group of four lines
while Check_For_End_Of_File == 0:
# Read the next four lines from the FASTQ file
ID_Line_1 = INFILE.readline()
Seq_Line = INFILE.readline()
ID_Line_2 = INFILE.readline()
Quality_Line = INFILE.readline()
# Strip off leading and trailing whitespace characters
ID_Line_1 = ID_Line_1.strip()
Seq_Line = Seq_Line.strip()
ID_Line_2 = ID_Line_2.strip()
Quality_Line = Quality_Line.strip()
Reads = Reads + 1
#Check that I have not reached the end of file
if Quality_Line == "":
#End of file reached, print update
print ("Saw " + str(Reads) + " reads")
print ("Wrote " + str(Writes) + " reads")
Check_For_End_Of_File = 1
break
#Check that ID_Line_1 starts with @
if not ID_Line_1.startswith('@'):
print ("**ERROR**")
print (each)
print ("Read Number " + str(Reads))
print ID_Line_1 + ' does not start with @'
break #ends the while loop
# Select Reads that I want to keep
ID = ID_Line_1.partition(' ')
if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"):
# Write to file, maintaining group of 4
OUTFILE.write(ID_Line_1 + "\n")
OUTFILE.write(Seq_Line + "\n")
OUTFILE.write(ID_Line_2 + "\n")
OUTFILE.write(Quality_Line + "\n")
Writes = Writes +1
INFILE.close()
OUTFILE.close()
More information about the Tutor
mailing list