[Tutor] simple python scrip for collocation discovery

Sat Aug 16 20:20:04 CEST 2008

Dear Steve,
Thank you so much for your help.
Actually this is not homework. It's gonna be used in building a collocation
dictionary as part of my dissertation. Please remeber that I'm not a
programmer, so many of the terminologies may not be accessible to me.

Thank you also for attracting my attention to the typo

On Sat, Aug 16, 2008 at 2:09 PM, Steve Willoughby <steve at alchemy.com> wrote:

> On Sat, Aug 16, 2008 at 01:55:36PM -0400, Emad Nawfal (???? ????) wrote:
> > Hello Tutors,
> > I'm trying to write a small scrip to find collocations using chi squared,
>
> Minor nit, but the word you're looking for here is "script".  "Scrip"
> is also an English word but means something completely different.
> Looking professional in a field includes using the jargon correctly.
>
> > depending on a fairly big corpus.
> > The program below does a good job, but it's too slow, and I need to
> process
> > something like 50 million words.
> > How can I make it run fast?
>
> How fast is fast enough?
> What's its time now?
> Can you identify where it might be slowing down?
>
> Depending on the order of magnitude of the speedup you're looking
> to achieve, the answer could be very different.
>
> > # Let's first get the text and turn into bigrams
> > bigrams = []
> > infile = file("corpus.txt")
> > text = infile.read().lower().split()
>
> This strikes me as needlessly memory-consuming.  You might want to
> iterate over lines of text instead of sucking the entire file into
> a giant string.  I'm guessing you want to recognize words next to
> one another even if separated by a newline?  Be aware of the cost,
> though, of passing potentially huge data values around.
>
> > infile.close()
> > for i,v in enumerate(text): # get words and their ranking number
> >      if i < len(text)-1: # This guarntees that the list index is not out
> of
> > range
> >           bigram = v, text[i+1] # each word and the two succeding words
> >           bigrams.append(bigram)
>
> Why don't you trust enumerate's i values, out of curiosity?
>
> It seems to me if you think of what you're collecting here
> you could do some calculating on the fly as you look through
> the list of words and wouldn't need to be building these
> lists and then going back through them.
>
> I'm trying to give some vague help without doing the work for you
> because we don't do homework exercises for people :)
>
> --
> Steve Willoughby    |  Using billion-dollar satellites
> steve at alchemy.com   |  to hunt for Tupperware.
>

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
http://emnawfal.googlepages.com
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20080816/a4dc1791/attachment.htm>