[Tutor] Expanding a Python script to include a zcat and awkpre-process

Fri Jan 8 09:35:41 CET 2010

<galaxywatcher at gmail.com> wrote in message 
news:5DFF81F3-A087-41F2-984B-914A08E2B2FC at gmail.com...
>I wrote a simple Python script to process a text file, but I had to  run a 
>shell one liner to get the text file primed for the script. I  would much 
>rather have the Python script handle the whole task without  any 
>pre-processing at all. I will show 1) a small sample of the text  file, 2) 
>my script, 3) the one liner that I want to fold into the  script, and 4) 
>the task at hand.
>
> 1) $ zcat textfile.txt.zip | head -5
> 134873600, 134873855, "32787 Protex Technologies, Inc."
> 135338240, 135338495, 40597
> 135338496, 135338751, 40993
> 201720832, 201721087, "12838 HFF Infrastructure & Operations"
> 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc 
> Roussillon"
>
>
> 2) $ cat getranges.py
> #!/usr/bin/env python
>
> import string
>
> highflag = flagcount = sum = sumtotal = 0
> infile = open("textfile.txt", "r")
> # Find the range by subtracting column 1 from column 2
> for line in infile:
>     num1, num2 = string.split(line)
>     sum = int(num2) - int(num1)
>     if sum > 10000000:
>         flag1 = " !!!!"
>         flagcount += 1
>         if sum > highflag:
>             highflag = sum
>     else:
>         flag1 = ""
>     print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
>     sumtotal = sumtotal + sum
> print "Total ranges = ", sumtotal
> print "Total # of ranges over 10 million: ", flagcount
> print "Largest range: ", highflag
>
> 3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt
>
> 4) In my first iteration, I used string.split(num1, ",") but I ran  into 
> trouble when I encountered commas within column 3, such as "32787 
> Protexic Technologies, Inc.". I don't know how to handle this  exception. 
> I also don't know how to uncompress the file in Python and  pass it to the 
> rest of the script. Hence I used my zcat | awk oneliner  to get the job 
> done. So how do I uncompress zip and gzipped files in  Python, and how do 
> I force split to only evaluate the first two  columns? Better yet, can I 
> tell split to not evaluate commas in the  double quoted 3rd column?

Check out the csv, zipfile, and gzip modules in the Python documentation.

-Mark