[Tutor] Problem When Iterating Over Large Test Files
Wayne Werner
wayne at waynewerner.com
Thu Jul 19 13:18:00 CEST 2012
Just a few notes...
On Wed, 18 Jul 2012, Ryan Waples wrote:
<snip>
>
> import glob
>
> my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq')
>
> for each in my_in_files:
> #print(each)
> out = each.replace('/gzip', '/rem_clusters2' )
> #print (out)
> INFILE = open (each, 'r')
> OUTFILE = open (out , 'w')
>
It's slightly confusing to see your comments left-aligned instead of with the
code they refer to. At first glance it looked as though your block ended here,
when it does, in fact, continue.
> # Tracking Variables
> Reads = 0
> Writes = 0
> Check_For_End_Of_File = 0
>
> #Updates
> print ("Reading File: " + each)
> print ("Writing File: " + out)
>
> # Read FASTQ File by group of four lines
> while Check_For_End_Of_File == 0:
This is Python, not C - checking for EOF is probably silly (unless you're
really checking for end of data) - you can just do:
for line in INFILE:
ID_Line_1 = line
Seq_line = next(INFILE) # Replace with INFILE.next() for Python2
ID_Line_2 = next(INFILE)
Quality_Line = next(INFILE)
>
> # Read the next four lines from the FASTQ file
> ID_Line_1 = INFILE.readline()
> Seq_Line = INFILE.readline()
> ID_Line_2 = INFILE.readline()
> Quality_Line = INFILE.readline()
>
> # Strip off leading and trailing whitespace characters
> ID_Line_1 = ID_Line_1.strip()
> Seq_Line = Seq_Line.strip()
> ID_Line_2 = ID_Line_2.strip()
> Quality_Line = Quality_Line.strip()
>
Also, it's just extra clutter to call strip like this when you can just tack it
on to your original statement:
for line in INFILE:
ID_Line_1 = line.strip()
Seq_line = next(INFILE).strip() # Replace with INFILE.next() for Python2
ID_Line_2 = next(INFILE).strip()
Quality_Line = next(INFILE).strip()
> Reads = Reads + 1
>
> #Check that I have not reached the end of file
> if Quality_Line == "":
> #End of file reached, print update
> print ("Saw " + str(Reads) + " reads")
> print ("Wrote " + str(Writes) + " reads")
> Check_For_End_Of_File = 1
> break
This break is superfluous - it will actually remove you from the while loop -
no further lines of code will be evaluated, including the original `while`
comparison. You can also just test the Quality_Line for truthiness directly,
since empty string evaluate to false. I would actually just say:
if Quality_Line:
#Do the rest of your stuff here
>
> #Check that ID_Line_1 starts with @
> if not ID_Line_1.startswith('@'):
> print ("**ERROR**")
> print (each)
> print ("Read Number " + str(Reads))
> print ID_Line_1 + ' does not start with @'
> break #ends the while loop
>
> # Select Reads that I want to keep
> ID = ID_Line_1.partition(' ')
> if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"):
> # Write to file, maintaining group of 4
> OUTFILE.write(ID_Line_1 + "\n")
> OUTFILE.write(Seq_Line + "\n")
> OUTFILE.write(ID_Line_2 + "\n")
> OUTFILE.write(Quality_Line + "\n")
> Writes = Writes +1
>
>
> INFILE.close()
> OUTFILE.close()
You could (as long as you're on 2.6 or greater) just use the `with` block for
reading the files then you don't need to worry about closing - the block takes
care of that, even on errors:
for each in my_in_files:
out = each.replace('/gzip', '/rem_clusters2' )
with open (each, 'r') as INFILE, open (out, 'w') as OUTFILE:
for line in INFILE:
# Do your work here...
A few stylistic points:
ALL_CAPS are usually reserved for constants - infile and outfile are perfectly
legitimate names.
Caps_In_Variable_Names are usually discouraged. Class names should be CamelCase
(e.g. SimpleHTTPServer), while variable names should be lowercase with
underscores if needed, so id_line_1 instead of ID_Line_1.
If you're using Python3 or from __future__ import print_function, rather than
doing OUTFILE.write(value + '\n') you can do:
print(value, file=OUTFILE)
Then you get the \n for free. You could also just do:
print(val1, val2, val3, sep='\n', end='\n', file=OUTFILE)
The end parameter is there for example only, since the default value for end is
'\n'
HTH,
Wayne
More information about the Tutor
mailing list