[Tutor] simple python scrip for collocation discovery

Emad Nawfal (عماد نوفل) emadnawfal at gmail.com
Sat Aug 16 19:55:36 CEST 2008


Hello Tutors,
I'm trying to write a small scrip to find collocations using chi squared,
depending on a fairly big corpus.
The program below does a good job, but it's too slow, and I need to process
something like 50 million words.
How can I make it run fast?
Your help is appreciated.
Emad nawfal


#! usr/bin/python
# Chi-squared collocation discovery
# Important definitions first. Let's suppose that we
# are trying to find whether "powerful computers" is a collocation
# N = The number of all bigrams in the corpus
# O11 = how many times the bigram "powerful computers" occurs in the corpus
# O22 = the number of bigrams not having either word in our collocation = N
- O11
#  O12 = The number of bigrams whose second word is our second word
# but whose first word is not "powerful"
# O21 = The number of bigrams whose first word is our first word, but whose
second word
# is different from oour second word
###########################################################

print """
*************************************************
*   Welcome to the Collocationer
*                                        *
*                                               *
*************************************************
"""
# Let's first get the text and turn into bigrams
bigrams = []
infile = file("corpus.txt")
text = infile.read().lower().split()
infile.close()
for i,v in enumerate(text): # get words and their ranking number
     if i < len(text)-1: # This guarntees that the list index is not out of
range
          bigram = v, text[i+1] # each word and the two succeding words
          bigrams.append(bigram)



tested_collocate = raw_input("Enter the bigram you think is a
collocation\n")
word1 = tested_collocate.split()[0]
word2 = tested_collocate.split()[1]

N = len(bigrams)
O11 = bigrams.count(tuple(tested_collocate.split()))
O22 = [bigram for bigram in bigrams if word1 !=  bigram[0] and word2 !=
bigram[1]]
O12 = [bigram for bigram in bigrams if bigram[1] == word2 and bigram[0] !=
word1]
O21 = [bigram for bigram in bigrams if bigram[0]== word1 and bigram[1] !=
word2]


O22 = len(O22)
O12 = len(O12)
O21 = len(O21)


chi2 = (N * ((O11 * O22 - O12 * O21) ** 2))/ float((O11 + O12) * (O11 + O21)
* (O12 + O22) * (O21 + O22))
print "Chi-Squared = ", chi2
if chi2 > 3.841:
    print "These two words form a collocation"
else:
    print "These two words do not form a collocation"

raw_input('Enter to Exit')









-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
http://emnawfal.googlepages.com
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20080816/df614156/attachment.htm>


More information about the Tutor mailing list