[Tutor] simple python scrip for collocation discovery

Sun Aug 17 01:28:42 CEST 2008

Thank you so much Steve,
I followed your advice about calculating o the fly and it really rang a
bell. Now I have this script. It's faster and does not give me the nasty
memory error message the first one sometimes did:
# Chi-squared collocation discovery
# Important definitions first. Let's suppose that we
# are trying to find whether "powerful computers" is a collocation
# N = The number of all bigrams in the corpus
# O11 = how many times the bigram "powerful computers" occurs in the corpus
# O22 = the number of bigrams not having either word in our collocation = N
- O11
#  O12 = The number of bigrams whose second word is our second word
# but whose first word is not "powerful"
# O21 = The number of bigrams whose first word is our first word, but whose
second word
# is different from oour second word
###########################################################

print """
*************************************************
*   Welcome to the Collocationer
*                                        *
*                                               *
*************************************************
"""
# Let's first get the text and turn into bigrams
#tested_collocate = raw_input("Enter the bigram you think is a
collocation\n")
#word1 = tested_collocate.split()[0]
#word2 = tested_collocate.split()[1]
word1 = 'United'
word2 = 'States'

infile = file("1.txt")
# initilize the counters

N = 0
O11= 0
O22 = 0
O12 = 0
O21 = 0
for line in infile:
    length = len(line.split()) # a variable to hold the length of each line

    if len(line.split()) <=1:
        continue
    for word in line.split():
        N+=1
    for i,v in enumerate(line.split()):
        if i< length-1:
            if word1 == v and word2 == line.split()[i+1]:
                O11 +=1
    for i,v in enumerate(line.split()):
        if i < length -1:
            if word1 != v and word2 != line.split()[i+1]:
                O22+=1
    for i,v in enumerate(line.split()):
        if i< length-1:
            if word1 != v and word2 == line.split()[i+1]:
                O12+=1
    for i,v in enumerate(line.split()):
        if i< length-1:
            if word1 == v and word2 != line.split()[i+1]:
                O21+=1

chi2 = (N * ((O11 * O22 - O12 * O21) ** 2))/ float((O11 + O12) * (O11 + O21)
* (O12 + O22) * (O21 + O22))
print "Chi-Squared = ", chi2
if chi2 > 3.841:
    print "These two words form a collocation"
else:
    print "These two words do not form a collocation"

On Sat, Aug 16, 2008 at 2:09 PM, Steve Willoughby <steve at alchemy.com> wrote:

> On Sat, Aug 16, 2008 at 01:55:36PM -0400, Emad Nawfal (???? ????) wrote:
> > Hello Tutors,
> > I'm trying to write a small scrip to find collocations using chi squared,
>
> Minor nit, but the word you're looking for here is "script".  "Scrip"
> is also an English word but means something completely different.
> Looking professional in a field includes using the jargon correctly.
>
> > depending on a fairly big corpus.
> > The program below does a good job, but it's too slow, and I need to
> process
> > something like 50 million words.
> > How can I make it run fast?
>
> How fast is fast enough?
> What's its time now?
> Can you identify where it might be slowing down?
>
> Depending on the order of magnitude of the speedup you're looking
> to achieve, the answer could be very different.
>
> > # Let's first get the text and turn into bigrams
> > bigrams = []
> > infile = file("corpus.txt")
> > text = infile.read().lower().split()
>
> This strikes me as needlessly memory-consuming.  You might want to
> iterate over lines of text instead of sucking the entire file into
> a giant string.  I'm guessing you want to recognize words next to
> one another even if separated by a newline?  Be aware of the cost,
> though, of passing potentially huge data values around.
>
> > infile.close()
> > for i,v in enumerate(text): # get words and their ranking number
> >      if i < len(text)-1: # This guarntees that the list index is not out
> of
> > range
> >           bigram = v, text[i+1] # each word and the two succeding words
> >           bigrams.append(bigram)
>
> Why don't you trust enumerate's i values, out of curiosity?
>
> It seems to me if you think of what you're collecting here
> you could do some calculating on the fly as you look through
> the list of words and wouldn't need to be building these
> lists and then going back through them.
>
> I'm trying to give some vague help without doing the work for you
> because we don't do homework exercises for people :)
>
> --
> Steve Willoughby    |  Using billion-dollar satellites
> steve at alchemy.com   |  to hunt for Tupperware.
>

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
http://emnawfal.googlepages.com
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20080816/7ec47565/attachment.htm>