[Tutor] Problem When Iterating Over Large Test Files
Ryan Waples
ryan.waples at gmail.com
Thu Jul 19 04:33:27 CEST 2012
Thanks for the replies, I'll try to address the questions raised and
spur further conversation.
>"those numbers (4GB and 64M lines) look suspiciously close to the file and record pointer limits to a 32-bit file system. Are you sure you aren't bumping into wrap around issues of some sort?"
My understanding is that I am taking the files in a stream, one line
at a time and never loading them into memory all at once. I would
like (and expect) my script to be able to handle files up to at least
50GB. If this would cause a problem, let me know.
> "my hunch is you might be having issues related to linux to dos EOF char."
I don't think this is the issue. 99.99% of the lines come out ok,
(see examples). I do end up with an output file with some 50 some mil
lines. I can confirm that my python code as written and executed on
Win7 will convert the original file endings from Unix (LF) to windows
(CRLF). This shouldn't confuse the downstream analysis.
> "What are you doing to test that they don't match the original?"
Those 2 pieces of example data have been grep'd (cygwin) out of the IN
and OUT files, they represent the output of a grep that pull the 20
lines surrounding the line:
@HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
which is a unique line in each.
I have also grep'd the IN file for a line in the OUT:
grep ^TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT$
with no results
The python code posted has a (weak) check, that mostly serves to
confirm that every fourth line of the IN file starts with an "@", this
IS the case for the IN file, but is NOT the case for the OUT file.
I can run my analysis program program on the raw IN file fine, it will
process all entries. When the OUT file is supplied, it will error at
reads in the pasted text.
> "Earlier, you stated that each record should be four lines. But your sample data starts with a record of three lines."
I've checked again and they look to be four lines, so I'm not sure I understand.
Format:
1) ID line (must start with @) - contains filter criteria
2) Data line 1 - 101 chars
3) just a "+"
4) Data line 2 - 101 chars (may start with @)
> "Do they occur at random, or is this repeatable?"
When I'm back at work I'll confirm again that this is the case, I
should have a better answer here. I can confirm that it seems to
happen to every (large) file I've tested, no files seem unaffected.
Thanks
__SHORTENED PYTHON CODE__
for each in my_in_files:
out = each.replace('/gzip', '/rem_clusters2' )
INFILE = open (each, 'r')
OUTFILE = open (out , 'w')
# Tracking Variables
Reads = 0
Writes = 0
Check_For_End_Of_File = 0
# Read FASTQ File by group of four lines
while Check_For_End_Of_File == 0:
ID_Line_1 = INFILE.readline()
Seq_Line = INFILE.readline()
ID_Line_2 = INFILE.readline()
Quality_Line = INFILE.readline()
ID_Line_1 = ID_Line_1.strip()
Seq_Line = Seq_Line.strip()
ID_Line_2 = ID_Line_2.strip()
Quality_Line = Quality_Line.strip()
Reads = Reads + 1
#Check that I have not reached the end of file
if Quality_Line == "":
Check_For_End_Of_File = 1
break
#Check that ID_Line_1 starts with @
if not ID_Line_1.startswith('@'):
break
# Select Reads that I want to keep
ID = ID_Line_1.partition(' ')
if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"):
# Write to file, maintaining group of 4
OUTFILE.write(ID_Line_1 + "\n")
OUTFILE.write(Seq_Line + "\n")
OUTFILE.write(ID_Line_2 + "\n")
OUTFILE.write(Quality_Line + "\n")
Writes = Writes +1
INFILE.close()
OUTFILE.close()
More information about the Tutor
mailing list