[Tutor] Expanding a Python script to include a zcat and awk pre-process

Fri Jan 8 13:59:43 CET 2010

galaxywatcher at gmail.com dixit:

> I wrote a simple Python script to process a text file, but I had to  
> run a shell one liner to get the text file primed for the script. I  
> would much rather have the Python script handle the whole task without  
> any pre-processing at all. I will show 1) a small sample of the text  
> file, 2) my script, 3) the one liner that I want to fold into the  
> script, and 4) the task at hand.
> 
> 1) $ zcat textfile.txt.zip | head -5
> 134873600, 134873855, "32787 Protex Technologies, Inc."
> 135338240, 135338495, 40597
> 135338496, 135338751, 40993
> 201720832, 201721087, "12838 HFF Infrastructure & Operations"
> 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc  
> Roussillon"
> 
> 
> 2) $ cat getranges.py
> #!/usr/bin/env python
> 
> import string
> 
> highflag = flagcount = sum = sumtotal = 0
> infile = open("textfile.txt", "r")
> # Find the range by subtracting column 1 from column 2
> for line in infile:
>      num1, num2 = string.split(line)
>      sum = int(num2) - int(num1)
>      if sum > 10000000:
>          flag1 = " !!!!"
>          flagcount += 1
>          if sum > highflag:
>              highflag = sum
>      else:
>          flag1 = ""
>      print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1
>      sumtotal = sumtotal + sum
> print "Total ranges = ", sumtotal
> print "Total # of ranges over 10 million: ", flagcount
> print "Largest range: ", highflag
> 
> 3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt
> 
> 4) In my first iteration, I used string.split(num1, ",") but I ran  
> into trouble when I encountered commas within column 3, such as "32787  
> Protexic Technologies, Inc.". I don't know how to handle this  
> exception. I also don't know how to uncompress the file in Python and  
> pass it to the rest of the script. Hence I used my zcat | awk oneliner  
> to get the job done. So how do I uncompress zip and gzipped files in  
> Python, and how do I force split to only evaluate the first two  
> columns? Better yet, can I tell split to not evaluate commas in the  
> double quoted 3rd column?
> 
> Regards,
> Blake

There are several possibilities:

1) The choice of ',' as separator for data that can contain commas is , hem, not very clever ;-)
Can you change that, so as to solve the issue at its source? (eg: any text processor allows converting a table to plain text using whatever separator). CSV is not a panacea...

2) Preprocess data to replace commas _outside quotes_ by a better chosen sep, such as TAB
(eg read data while keeping an "in_quotes" flag).

3) Use a more powerful text processing tool, such as regexes:

data = '''\
134873600, 134873855, "32787 Protex Technologies, Inc."
135338240, 135338495, 40597
135338496, 135338751, 40993
201720832, 201721087, "12838 HFF Infrastructure & Operations"
202739456, 202739711, "1623 Beseau Regional de la Region Languedoc Roussillon"'''
import re
pat = re.compile(r"""(\d+), (\d+), (\".+\"|\d+)""")
for line in data.splitlines():
	result = pat.match(line)
	print result.groups()
==>
('134873600', '134873855', '"32787 Protex Technologies, Inc."')
('135338240', '135338495', '40597')
('135338496', '135338751', '40993')
('201720832', '201721087', '"12838 HFF Infrastructure & Operations"')
('202739456', '202739711', '"1623 Beseau Regional de la Region Languedoc Roussillon"')

Denis

________________________________

la vita e estrany

http://spir.wikidot.com/