[Tutor] Syntax for Simplest Way to Execute One Python Program Over 1000's of Datasets

Thu Jun 9 22:30:48 CEST 2011

On 06/09/2011 03:49 PM, B G wrote:
> I'm trying to analyze thousands of different cancer datasets and run the
> same python program on them.  I use Windows XP, Python 2.7 and the IDLE
> interpreter.  I already have the input files in a directory and I want to
> learn the syntax for the quickest way to execute the program over all these
> datasets.
> 
> As an example,for the sample python program below, I don't want to have to
> go into the python program each time and change filename and countfile.  A
> computer could do this much quicker than I ever could.  Thanks in advance!
> 

I think os.listdir() would be better for you than os.walk(), as Walter
suggested, but if you have a directory tree, walk is better. Your file
code could be simplified a lot by using context managers, which for a
file looks like this:

with open(filename, mode) as f:
    f.write("Stuff!")

f will automatically be closed and everything.

Now for some code review!

> 
> import string
> 
> filename = 'draft1.txt'
> countfile = 'draft1_output.txt'
> 
> def add_word(counts, word):
>     if counts.has_key(word):
>         counts[word] += 1
>     else:
>         counts[word] = 1
> 

See the notes on this later.

> def get_word(item):
>     word = ''
>     item = item.strip(string.digits)
>     item = item.lstrip(string.punctuation)
>     item = item.rstrip(string.punctuation)
>     word = item.lower()
>     return word

This whole function could be simplified to:

return item.strip(string.digits + string.punctuation).lower()

Note that foo.strip(bar) == foo.lstrip(bar).rstrip(bar)

> 
> 
> def count_words(text):
>     text = ' '.join(text.split('--')) #replace '--' with a space

How about

text = text.replace('--', ' ')

>     items = text.split() #leaves in leading and trailing punctuation,
>                          #'--' not recognised by split() as a word separator

Or, items = text.split('--')

You can specify the split string! You should read the docs on string
methods:
http://docs.python.org/library/stdtypes.html#string-methods

>     counts = {}
>     for item in items:
>         word = get_word(item)
>         if not word == '':

That should be 'if word:', which just checks if it evaluates to True.
Since the only string that evaluate to False is '', it makes the code
shorter and more readable.

>             add_word(counts, word)
>     return counts

A better way would be using a DefaultDict, like so:

from collections import defaultdict
[...]

def count_words(text):
    counts = defaultdict(int) # Every key starts off at 0!
    items = text.split('--')
    for item in items:
        word = get_word(item)
        if word:
            counts[word] += 1
    return counts

Besides that things have a default value, a defaultdict is the same as
any other dict. We pass 'int' as a parameter because defaultdict uses
the parameter as a function for the default value. It works out because
int() == 0.

> 
> infile = open(filename, 'r')
> text = infile.read()
> infile.close()

This could be:

text = open(filename).read()

When you're opening a file as 'r', the mode is optional!

> 
> counts = count_words(text)
> 
> outfile = open(countfile, 'w')
> outfile.write("%-18s%s\n" %("Word", "Count"))
> outfile.write("=======================\n")

It may just be me, but I think

outfile.write(('=' * 23) + '\n')

looks better.

> 
> counts_list = counts.items()
> counts_list.sort()
> for word in counts_list:
>     outfile.write("%-18s%d\n" %(word[0], word[1]))
> 
> outfile.close

Parenthesis are important! outfile.close is a method object,
outfile.close() is a method call. Context managers make this easy,
because you don't have to manually close things.

Hope it helped,
-- 
Corey Richardson