[Tutor] Problem When Iterating Over Large Test Files

Thu Jul 19 13:18:00 CEST 2012

Just a few notes...

On Wed, 18 Jul 2012, Ryan Waples wrote:
<snip>
>
> import glob
>
> my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq')
>
> for each in my_in_files:
> 	#print(each)
> 	out = each.replace('/gzip', '/rem_clusters2' )
> 	#print (out)
> 	INFILE = open (each, 'r')
> 	OUTFILE = open (out , 'w')
>

It's slightly confusing to see your comments left-aligned instead of with the
code they refer to. At first glance it looked as though your block ended here,
when it does, in fact, continue.

> # Tracking Variables
> 	Reads = 0
> 	Writes = 0
> 	Check_For_End_Of_File = 0
>
> #Updates
> 	print ("Reading File: " + each)
> 	print ("Writing File: " + out)
>
> # Read FASTQ File by group of four lines
> 	while Check_For_End_Of_File == 0:

This is Python, not C - checking for EOF is probably silly (unless you're
really checking for end of data) - you can just do:

for line in INFILE:
     ID_Line_1 = line
     Seq_line = next(INFILE) # Replace with INFILE.next() for Python2
     ID_Line_2 = next(INFILE)
     Quality_Line = next(INFILE)

>
> 		# Read the next four lines from the FASTQ file
> 		ID_Line_1		= INFILE.readline()
> 		Seq_Line 		= INFILE.readline()
> 		ID_Line_2 		= INFILE.readline()
> 		Quality_Line 	= INFILE.readline()
>
> 		# Strip off leading and trailing whitespace characters
> 		ID_Line_1		= ID_Line_1.strip()
> 		Seq_Line		= Seq_Line.strip()
> 		ID_Line_2		= ID_Line_2.strip()
> 		Quality_Line 	= Quality_Line.strip()
>

Also, it's just extra clutter to call strip like this when you can just tack it
on to your original statement:

for line in INFILE:
     ID_Line_1 = line.strip()
     Seq_line = next(INFILE).strip() # Replace with INFILE.next() for Python2
     ID_Line_2 = next(INFILE).strip()
     Quality_Line = next(INFILE).strip()

> 		Reads = Reads + 1
>
> 		#Check that I have not reached the end of file
> 		if Quality_Line == "":
> 			#End of file reached, print update
> 			print ("Saw " + str(Reads) + " reads")
> 			print ("Wrote " + str(Writes) + " reads")
> 			Check_For_End_Of_File = 1
> 			break

This break is superfluous - it will actually remove you from the while loop -
no further lines of code will be evaluated, including the original `while`
comparison. You can also just test the Quality_Line for truthiness directly,
since empty string evaluate to false. I would actually just say:

if Quality_Line:
     #Do the rest of your stuff here

>
> 		#Check that ID_Line_1 starts with @
> 		if not ID_Line_1.startswith('@'):
> 			print ("**ERROR**")
> 			print (each)
> 			print ("Read Number " + str(Reads))
> 			print ID_Line_1 + ' does not start with @'
> 			break #ends the while loop
>
> 		# Select Reads that I want to keep
> 		ID = ID_Line_1.partition(' ')
> 		if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"):
> 			# Write to file, maintaining group of 4
> 			OUTFILE.write(ID_Line_1 + "\n")
> 			OUTFILE.write(Seq_Line + "\n")
> 			OUTFILE.write(ID_Line_2 + "\n")
> 			OUTFILE.write(Quality_Line + "\n")
> 			Writes = Writes +1
>
>
> 	INFILE.close()
> 	OUTFILE.close()

You could (as long as you're on 2.6 or greater) just use the `with` block for
reading the files then you don't need to worry about closing - the block takes
care of that, even on errors:

for each in my_in_files:
     out = each.replace('/gzip', '/rem_clusters2' )
     with open (each, 'r') as INFILE, open (out, 'w') as OUTFILE:
         for line in INFILE:
             # Do your work here...

A few stylistic points:
ALL_CAPS are usually reserved for constants - infile and outfile are perfectly
legitimate names.

Caps_In_Variable_Names are usually discouraged. Class names should be CamelCase
(e.g. SimpleHTTPServer), while variable names should be lowercase with
underscores if needed, so id_line_1 instead of ID_Line_1.

If you're using Python3 or from __future__ import print_function, rather than
doing OUTFILE.write(value + '\n') you can do:

     print(value, file=OUTFILE)

Then you get the \n for free. You could also just do:

     print(val1, val2, val3, sep='\n', end='\n', file=OUTFILE)

The end parameter is there for example only, since the default value for end is
'\n'

HTH,
Wayne