[Tutor] simple python scrip for collocation discovery

Sun Aug 17 04:28:42 CEST 2008

Emad Nawfal (عماد نوفل) wrote:
> Thank you so much Steve,
> I followed your advice about calculating o the fly and it really rang 
> a bell. Now I have this script. It's faster and does not give me the 
> nasty memory error message the first one sometimes did:
> # Chi-squared collocation discovery
> # Important definitions first. Let's suppose that we
> # are trying to find whether "powerful computers" is a collocation
> # N = The number of all bigrams in the corpus
> # O11 = how many times the bigram "powerful computers" occurs in the 
> corpus
> # O22 = the number of bigrams not having either word in our 
> collocation = N - O11
> #  O12 = The number of bigrams whose second word is our second word
> # but whose first word is not "powerful"
> # O21 = The number of bigrams whose first word is our first word, but 
> whose second word
> # is different from oour second word
> ###########################################################
>  
> print """
> *************************************************
> *   Welcome to the Collocationer                
> *                                        *
> *                                               *
> *************************************************
> """
> # Let's first get the text and turn into bigrams
> #tested_collocate = raw_input("Enter the bigram you think is a 
> collocation\n")
> #word1 = tested_collocate.split()[0]
> #word2 = tested_collocate.split()[1]
> word1 = 'United'
> word2 = 'States'
>  
> infile = file("1.txt")
> # initilize the counters
>  
> N = 0
> O11= 0
> O22 = 0
> O12 = 0
> O21 = 0
> for line in infile:
>     length = len(line.split()) # a variable to hold the length of each 
> line
>
>     if len(line.split()) <=1:
>         continue
>     for word in line.split():
>         N+=1
>     for i,v in enumerate(line.split()):
>         if i< length-1:
>             if word1 == v and word2 == line.split()[i+1]:
>                 O11 +=1
>     for i,v in enumerate(line.split()):
>         if i < length -1:
>             if word1 != v and word2 != line.split()[i+1]:
>                 O22+=1
>     for i,v in enumerate(line.split()):
>         if i< length-1:
>             if word1 != v and word2 == line.split()[i+1]:
>                 O12+=1
>     for i,v in enumerate(line.split()):
>         if i< length-1:
>             if word1 == v and word2 != line.split()[i+1]:
>                 O21+=1
>    
>                  
>  
>  
> chi2 = (N * ((O11 * O22 - O12 * O21) ** 2))/ float((O11 + O12) * (O11 
> + O21) * (O12 + O22) * (O21 + O22))
> print "Chi-Squared = ", chi2
> if chi2 > 3.841:
>     print "These two words form a collocation"
> else:
>     print "These two words do not form a collocation"
>  
I'd like to jump in here and offer a few refinements that make the code 
simpler and more "Pythonic". In the background I'm also researching how 
to use dictionaries to make things even better. Some guidelines:
-  use initial lower case for variable and function names, upper case 
for classes
-  don't repeat calculations - do them once and save the result in a 
variable
-  don't repeat loops - you can put the calculations for o11 o12 o21 and 
o22 all under 1 for loop
-  obtain one word at a time as rightWord and then save it as leftWord

# your initial code goes here, up to but not including
# for line in infile:

line = infile.readline().split() # get the first line so we can get the 
first word
leftWord = line[0]
line = line[1:] # drop the first word
n = 1 # count the first word
o11 = o12 = o21 = o22 = 0
while line:
  n += len(line) # count words
  for rightWord in line:
    if word1 == leftWord and word2 == rightWord:
      o11 += 1
    elif word1 != leftWord and word2 != rightWord:
      o22 += 1
    elif word1 != leftWord and word2 == rightWord:
      o12 += 1
    else: # no need to test
      o21 += 1
    leftWord = rightWord
  line = infile.readline().split()

# rest of your code follows starting with
# chi2 = ...

# If you want to get even "sexier" you could create an array of counters
# counters = [[0,0],[0,0]]
# where the elements left to right represent o22, o12, o21 and o11
# taking advantage of the fact that False == 0 and True == 1:
  for rightWord in line:
    counters[word1 == leftWord][word2 == rightWord] += 1
    leftWord = rightWord
  line = infile.readline().split()

-- 
Bob Gailer
Chapel Hill NC 
919-636-4239

When we take the time to be aware of our feelings and 
needs we have more satisfying interatctions with others.

Nonviolent Communication provides tools for this awareness.

As a coach and trainer I can assist you in learning this process.

What is YOUR biggest relationship challenge?