[Tutor] Expanding a Python script to include a zcat and awk pre-process

galaxywatcher at gmail.com galaxywatcher at gmail.com
Fri Jan 8 08:24:47 CET 2010


I wrote a simple Python script to process a text file, but I had to  
run a shell one liner to get the text file primed for the script. I  
would much rather have the Python script handle the whole task without  
any pre-processing at all. I will show 1) a small sample of the text  
file, 2) my script, 3) the one liner that I want to fold into the  
script, and 4) the task at hand.

1) $ zcat textfile.txt.zip | head -5
134873600, 134873855, "32787 Protex Technologies, Inc."
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, "12838 HFF Infrastructure & Operations"
202739456, 202739711, "1623 Beseau Regional de la Region Languedoc  
Roussillon"


2) $ cat getranges.py
#!/usr/bin/env python

import string

highflag = flagcount = sum = sumtotal = 0
infile = open("textfile.txt", "r")
# Find the range by subtracting column 1 from column 2
for line in infile:
     num1, num2 = string.split(line)
     sum = int(num2) - int(num1)
     if sum > 10000000:
         flag1 = " !!!!"
         flagcount += 1
         if sum > highflag:
             highflag = sum
     else:
         flag1 = ""
     print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
     sumtotal = sumtotal + sum
print "Total ranges = ", sumtotal
print "Total # of ranges over 10 million: ", flagcount
print "Largest range: ", highflag

3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt

4) In my first iteration, I used string.split(num1, ",") but I ran  
into trouble when I encountered commas within column 3, such as "32787  
Protexic Technologies, Inc.". I don't know how to handle this  
exception. I also don't know how to uncompress the file in Python and  
pass it to the rest of the script. Hence I used my zcat | awk oneliner  
to get the job done. So how do I uncompress zip and gzipped files in  
Python, and how do I force split to only evaluate the first two  
columns? Better yet, can I tell split to not evaluate commas in the  
double quoted 3rd column?

Regards,
Blake


More information about the Tutor mailing list