# [Tutor] simple python scrip for collocation discovery

Sun Aug 17 01:28:42 CEST 2008

```Thank you so much Steve,
bell. Now I have this script. It's faster and does not give me the nasty
memory error message the first one sometimes did:
# Chi-squared collocation discovery
# Important definitions first. Let's suppose that we
# are trying to find whether "powerful computers" is a collocation
# N = The number of all bigrams in the corpus
# O11 = how many times the bigram "powerful computers" occurs in the corpus
# O22 = the number of bigrams not having either word in our collocation = N
- O11
#  O12 = The number of bigrams whose second word is our second word
# but whose first word is not "powerful"
# O21 = The number of bigrams whose first word is our first word, but whose
second word
# is different from oour second word
###########################################################

print """
*************************************************
*   Welcome to the Collocationer
*                                        *
*                                               *
*************************************************
"""
# Let's first get the text and turn into bigrams
#tested_collocate = raw_input("Enter the bigram you think is a
collocation\n")
#word1 = tested_collocate.split()[0]
#word2 = tested_collocate.split()[1]
word1 = 'United'
word2 = 'States'

infile = file("1.txt")
# initilize the counters

N = 0
O11= 0
O22 = 0
O12 = 0
O21 = 0
for line in infile:
length = len(line.split()) # a variable to hold the length of each line

if len(line.split()) <=1:
continue
for word in line.split():
N+=1
for i,v in enumerate(line.split()):
if i< length-1:
if word1 == v and word2 == line.split()[i+1]:
O11 +=1
for i,v in enumerate(line.split()):
if i < length -1:
if word1 != v and word2 != line.split()[i+1]:
O22+=1
for i,v in enumerate(line.split()):
if i< length-1:
if word1 != v and word2 == line.split()[i+1]:
O12+=1
for i,v in enumerate(line.split()):
if i< length-1:
if word1 == v and word2 != line.split()[i+1]:
O21+=1

chi2 = (N * ((O11 * O22 - O12 * O21) ** 2))/ float((O11 + O12) * (O11 + O21)
* (O12 + O22) * (O21 + O22))
print "Chi-Squared = ", chi2
if chi2 > 3.841:
print "These two words form a collocation"
else:
print "These two words do not form a collocation"

On Sat, Aug 16, 2008 at 2:09 PM, Steve Willoughby <steve at alchemy.com> wrote:

> On Sat, Aug 16, 2008 at 01:55:36PM -0400, Emad Nawfal (???? ????) wrote:
> > Hello Tutors,
> > I'm trying to write a small scrip to find collocations using chi squared,
>
> Minor nit, but the word you're looking for here is "script".  "Scrip"
> is also an English word but means something completely different.
> Looking professional in a field includes using the jargon correctly.
>
> > depending on a fairly big corpus.
> > The program below does a good job, but it's too slow, and I need to
> process
> > something like 50 million words.
> > How can I make it run fast?
>
> How fast is fast enough?
> What's its time now?
> Can you identify where it might be slowing down?
>
> Depending on the order of magnitude of the speedup you're looking
> to achieve, the answer could be very different.
>
> > # Let's first get the text and turn into bigrams
> > bigrams = []
> > infile = file("corpus.txt")
>
> This strikes me as needlessly memory-consuming.  You might want to
> iterate over lines of text instead of sucking the entire file into
> a giant string.  I'm guessing you want to recognize words next to
> one another even if separated by a newline?  Be aware of the cost,
> though, of passing potentially huge data values around.
>
> > infile.close()
> > for i,v in enumerate(text): # get words and their ranking number
> >      if i < len(text)-1: # This guarntees that the list index is not out
> of
> > range
> >           bigram = v, text[i+1] # each word and the two succeding words
> >           bigrams.append(bigram)
>
> Why don't you trust enumerate's i values, out of curiosity?
>
> It seems to me if you think of what you're collecting here
> you could do some calculating on the fly as you look through
> the list of words and wouldn't need to be building these
> lists and then going back through them.
>
> I'm trying to give some vague help without doing the work for you
> because we don't do homework exercises for people :)
>
> --
> Steve Willoughby    |  Using billion-dollar satellites
> steve at alchemy.com   |  to hunt for Tupperware.
>

--
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"