[Tutor] simple python scrip for collocation discovery
steve at alchemy.com
Sat Aug 16 20:09:23 CEST 2008
On Sat, Aug 16, 2008 at 01:55:36PM -0400, Emad Nawfal (???? ????) wrote:
> Hello Tutors,
> I'm trying to write a small scrip to find collocations using chi squared,
Minor nit, but the word you're looking for here is "script". "Scrip"
is also an English word but means something completely different.
Looking professional in a field includes using the jargon correctly.
> depending on a fairly big corpus.
> The program below does a good job, but it's too slow, and I need to process
> something like 50 million words.
> How can I make it run fast?
How fast is fast enough?
What's its time now?
Can you identify where it might be slowing down?
Depending on the order of magnitude of the speedup you're looking
to achieve, the answer could be very different.
> # Let's first get the text and turn into bigrams
> bigrams = 
> infile = file("corpus.txt")
> text = infile.read().lower().split()
This strikes me as needlessly memory-consuming. You might want to
iterate over lines of text instead of sucking the entire file into
a giant string. I'm guessing you want to recognize words next to
one another even if separated by a newline? Be aware of the cost,
though, of passing potentially huge data values around.
> for i,v in enumerate(text): # get words and their ranking number
> if i < len(text)-1: # This guarntees that the list index is not out of
> bigram = v, text[i+1] # each word and the two succeding words
Why don't you trust enumerate's i values, out of curiosity?
It seems to me if you think of what you're collecting here
you could do some calculating on the fly as you look through
the list of words and wouldn't need to be building these
lists and then going back through them.
I'm trying to give some vague help without doing the work for you
because we don't do homework exercises for people :)
Steve Willoughby | Using billion-dollar satellites
steve at alchemy.com | to hunt for Tupperware.
More information about the Tutor