[Tutor] Expanding a Python script to include a zcat and awk pre-process

galaxywatcher at gmail.com galaxywatcher at gmail.com
Sat Jan 9 18:34:54 CET 2010

I finally got it working! I would do a victory lap around my apartment  
building if I wasn't recovering from a broken ankle.

Excuse my excitement, but this simple script marks a new level of  
Python proficiency for me. Thanks to Kent, Bob, Denis, and others who  
pointed me in the right direction.
It does quite a few things: decompresses a zipped file or files if  
there is an archive of them, processes a rather ugly csv file (ugly  
because it uses a comma as a delimiter, yet there are commas in double  
quote separated fields), and it does a simple subtraction of the two  
columns with a little summary to give me the data I need.

#!/usr/bin/env python
import string
import re
import zipfile
highflag = flagcount = sum = sumtotal = 0
z = zipfile.ZipFile('textfile.zip')
for subfile in z.namelist():
     print "Working on filename: " + subfile + "\n"
     data = z.read(subfile)
     pat = re.compile(r"""(\d+), (\d+), (\".+\"|\w+)""")
     for line in data.splitlines():
         result = pat.match(line)
         ranges = result.groups()
         num1 = ranges[0]
         num2 = ranges[1]
         sum = int(num2) - int(num1)
         if sum > 10000000:
             flag1 = " !!!!"
             flagcount += 1
             flag1 = ""
         if sum > highflag:
             highflag = sum
         print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
         sumtotal = sumtotal + sum

print "Total ranges = ", sumtotal
print "Total ranges over 10 million: ", flagcount
print "Largest range: ", highflag

A few observations from a Python newbie: The zipfile and gzip modules  
should really be merged together. gzcat on unix reads both compression  
formats. It took me way too long to figure out the namelist() method.  
But I did learn a lot more about how zip actually works as a result.  
Is there really no way to extract the contents of a single zipped file  
without using a 'for in namelist' construct?

Trying to get split() to extract just two columns from my data was a  
dead end. The re module is the way to go.

I feel like I am in relatively new territory with Python's regex  
engine. Denis did save me some valuable time with his regex, but my  
file had values in the 3rd column that started with alphas as opposed  
to numerics only, and flipping that (\".+\"|\d+)""") to a (\".+\"|\w 
+)""") had me gnashing teeth and pulling hair the whole way through  
the regex tutorial. When I finally figured it out, I smack my forehead  
and say "of course!". The compile() method of Python's regex engine is  
new for me. Makes sense. Just something I have to get used to. I do  
have the feeling that Perl's regex is better. But that is another story.

More information about the Tutor mailing list