[Tutor] Getting total counts

Fri Oct 1 22:31:42 CEST 2010

Hi,

I have created a csv file that lists how often each word in the Internet Movie Database occurs with different star-ratings and in different genres. The input file looks something like this--since movies can have multiple genres, there are three genre rows. (This is fake, simplified data.)

ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count
film1        Drama        Thriller        Western        1        the        20
film2        Comedy        Musical        NA        2        the        20
film3        Musical        History        Biography        1        the        20
film4        Drama        Thriller        Western        1        the        10
film5        Drama        Thriller        Western        9        the        20

I can get the program to tell me how many occurrence of "the" there are in Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama "the"'s there are (30). But I need to be able to expand beyond a particular word and say "how many words total are in "Drama"? How many total words are in 1-star ratings? How many words are there in the whole corpus? On these all-word totals, I'm stumped. 

What I've done so far:
I used shelve() to store my input csv in a database format. 

Here's how I get count information so far:
def get_word_count(word, db, genre=None, rating=None):
    c = 0
    vals = db[word]
    for val in vals:
        if not genre and not rating:
            c += val['count']
        elif genre and not rating:
            if genre in val['genres']:            
                c += val['count']
        elif rating and not genre:
            if rating == val['rating']:
                c += val['count']        
        else:
            if rating == val['rating'] and genre in val['genres']:
                c += val['count']            
    return c

(I think there's something a little wrong with the rating stuff, here, but this code generally works and produces the right counts.)

With "get_word_count" I can do stuff like this to figure out how many times "the" appears in a particular genre. 
vals=db[word]
for val in vals:
genre_ct_for_word = get_word_count(word, db, genre, rating=None)
return genre_ct_for_word

I've tried to extend this thinking to get TOTAL genre/rating counts for all words, but it doesn't work. I get a type error saying that string indices must be integers. I'm not sure how to overcome this.

# Doesn't work:
def get_full_rating_count(db, rating=None):
    full_rating_ct = 0
    vals = db
    for val in vals:
        if not rating:
            full_rating_ct += val['count']
        elif rating == val['rating']:
            if rating == val['rating']: # Um, I know this looks dumb, but in the other code it seems to be necessary for things to work. 
                full_rating_ct += val['count']
    return full_rating_ct

Can anyone suggest how to do this? 

Thanks!

Tyler

Background for the curious:
What I really want to know is which words are over- or under-represented in different Genre x Rating categories. "The" should be flat, but something like "wow" should be over-represented in 1-star and 10-star ratings and under-represented in 5-star ratings. Something like "gross" may be over-represented in low-star ratings for romances but if grossness is a good thing in horror movies, then we'll see "gross" over-represented in HIGH-star ratings for horror. 

To figure out over-representation and under-representation I need to compare "observed" counts to "expected" counts. The expected counts are probabilities and they require me to understand how many words I have in the whole corpus and how many words in each rating category and how many words in each genre category.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101001/31410ffb/attachment-0001.html>