[Tutor] Syntax for Simplest Way to Execute One Python Program Over 1000's of Datasets
Corey Richardson
kb1pkl at aim.com
Thu Jun 9 22:30:48 CEST 2011
On 06/09/2011 03:49 PM, B G wrote:
> I'm trying to analyze thousands of different cancer datasets and run the
> same python program on them. I use Windows XP, Python 2.7 and the IDLE
> interpreter. I already have the input files in a directory and I want to
> learn the syntax for the quickest way to execute the program over all these
> datasets.
>
> As an example,for the sample python program below, I don't want to have to
> go into the python program each time and change filename and countfile. A
> computer could do this much quicker than I ever could. Thanks in advance!
>
I think os.listdir() would be better for you than os.walk(), as Walter
suggested, but if you have a directory tree, walk is better. Your file
code could be simplified a lot by using context managers, which for a
file looks like this:
with open(filename, mode) as f:
f.write("Stuff!")
f will automatically be closed and everything.
Now for some code review!
>
> import string
>
> filename = 'draft1.txt'
> countfile = 'draft1_output.txt'
>
> def add_word(counts, word):
> if counts.has_key(word):
> counts[word] += 1
> else:
> counts[word] = 1
>
See the notes on this later.
> def get_word(item):
> word = ''
> item = item.strip(string.digits)
> item = item.lstrip(string.punctuation)
> item = item.rstrip(string.punctuation)
> word = item.lower()
> return word
This whole function could be simplified to:
return item.strip(string.digits + string.punctuation).lower()
Note that foo.strip(bar) == foo.lstrip(bar).rstrip(bar)
>
>
> def count_words(text):
> text = ' '.join(text.split('--')) #replace '--' with a space
How about
text = text.replace('--', ' ')
> items = text.split() #leaves in leading and trailing punctuation,
> #'--' not recognised by split() as a word separator
Or, items = text.split('--')
You can specify the split string! You should read the docs on string
methods:
http://docs.python.org/library/stdtypes.html#string-methods
> counts = {}
> for item in items:
> word = get_word(item)
> if not word == '':
That should be 'if word:', which just checks if it evaluates to True.
Since the only string that evaluate to False is '', it makes the code
shorter and more readable.
> add_word(counts, word)
> return counts
A better way would be using a DefaultDict, like so:
from collections import defaultdict
[...]
def count_words(text):
counts = defaultdict(int) # Every key starts off at 0!
items = text.split('--')
for item in items:
word = get_word(item)
if word:
counts[word] += 1
return counts
Besides that things have a default value, a defaultdict is the same as
any other dict. We pass 'int' as a parameter because defaultdict uses
the parameter as a function for the default value. It works out because
int() == 0.
>
> infile = open(filename, 'r')
> text = infile.read()
> infile.close()
This could be:
text = open(filename).read()
When you're opening a file as 'r', the mode is optional!
>
> counts = count_words(text)
>
> outfile = open(countfile, 'w')
> outfile.write("%-18s%s\n" %("Word", "Count"))
> outfile.write("=======================\n")
It may just be me, but I think
outfile.write(('=' * 23) + '\n')
looks better.
>
> counts_list = counts.items()
> counts_list.sort()
> for word in counts_list:
> outfile.write("%-18s%d\n" %(word[0], word[1]))
>
> outfile.close
Parenthesis are important! outfile.close is a method object,
outfile.close() is a method call. Context managers make this easy,
because you don't have to manually close things.
Hope it helped,
--
Corey Richardson
More information about the Tutor
mailing list