[Tutor] Getting total counts
aeneas24 at priest.com
aeneas24 at priest.com
Fri Oct 1 22:31:42 CEST 2010
Hi,
I have created a csv file that lists how often each word in the Internet Movie Database occurs with different star-ratings and in different genres. The input file looks something like this--since movies can have multiple genres, there are three genre rows. (This is fake, simplified data.)
ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count
film1 Drama Thriller Western 1 the 20
film2 Comedy Musical NA 2 the 20
film3 Musical History Biography 1 the 20
film4 Drama Thriller Western 1 the 10
film5 Drama Thriller Western 9 the 20
I can get the program to tell me how many occurrence of "the" there are in Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama "the"'s there are (30). But I need to be able to expand beyond a particular word and say "how many words total are in "Drama"? How many total words are in 1-star ratings? How many words are there in the whole corpus? On these all-word totals, I'm stumped.
What I've done so far:
I used shelve() to store my input csv in a database format.
Here's how I get count information so far:
def get_word_count(word, db, genre=None, rating=None):
c = 0
vals = db[word]
for val in vals:
if not genre and not rating:
c += val['count']
elif genre and not rating:
if genre in val['genres']:
c += val['count']
elif rating and not genre:
if rating == val['rating']:
c += val['count']
else:
if rating == val['rating'] and genre in val['genres']:
c += val['count']
return c
(I think there's something a little wrong with the rating stuff, here, but this code generally works and produces the right counts.)
With "get_word_count" I can do stuff like this to figure out how many times "the" appears in a particular genre.
vals=db[word]
for val in vals:
genre_ct_for_word = get_word_count(word, db, genre, rating=None)
return genre_ct_for_word
I've tried to extend this thinking to get TOTAL genre/rating counts for all words, but it doesn't work. I get a type error saying that string indices must be integers. I'm not sure how to overcome this.
# Doesn't work:
def get_full_rating_count(db, rating=None):
full_rating_ct = 0
vals = db
for val in vals:
if not rating:
full_rating_ct += val['count']
elif rating == val['rating']:
if rating == val['rating']: # Um, I know this looks dumb, but in the other code it seems to be necessary for things to work.
full_rating_ct += val['count']
return full_rating_ct
Can anyone suggest how to do this?
Thanks!
Tyler
Background for the curious:
What I really want to know is which words are over- or under-represented in different Genre x Rating categories. "The" should be flat, but something like "wow" should be over-represented in 1-star and 10-star ratings and under-represented in 5-star ratings. Something like "gross" may be over-represented in low-star ratings for romances but if grossness is a good thing in horror movies, then we'll see "gross" over-represented in HIGH-star ratings for horror.
To figure out over-representation and under-representation I need to compare "observed" counts to "expected" counts. The expected counts are probabilities and they require me to understand how many words I have in the whole corpus and how many words in each rating category and how many words in each genre category.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101001/31410ffb/attachment-0001.html>
More information about the Tutor
mailing list