[Tutor] fine in interpreter, hangs in batch
Kent Johnson
kent37 at tds.net
Mon Mar 19 18:23:17 CET 2007
Switanek, Nick wrote:
> Thanks very much for your help.
>
> I did indeed neglect to put the "print" in the code that I sent to the
> list.
>
> It appears that the step that is taking a long time, and that therefore
> makes me think that the script is somehow broken, is creating a
> dictionary of frequencies from the list of ngrams. To do this, I've
> written, for example:
>
> bigramDict = {}
> bigrams = [' '.join(wordlist[i:i+2]) for i in range(len(wordlist)-2+1)]
> for bigram in bigrams:
> if bigram in bigramDict.keys(): bigramDict[bigram] += 1
> else: bigramDict[bigram] = 1
Ouch! bigramDict.keys() creates a *new* *list* of all the keys in
bigramDict. You then search the list - a linear search! - for bigram.
I'm not surprised that this gets slow.
If you change that line to
if bigram in bigramDict: bigramDict[bigram] += 1
you should see a dramatic improvement.
Kent
>
>
> With around 500,000 bigrams, this is taking over 25 minutes to run (and
> I haven't sat around to let it finish) on an XP machine at 3.0GHz and
> 1.5GB RAM. I bet I'm trying to reinvent the wheel here, and that there
> are faster algorithms available in some package. I think possibly an
> indexing package like PyLucene would help create frequency dictionaries,
> but I can't figure it out from the online material available. Any
> suggestions?
>
> Thanks,
> Nick
More information about the Tutor
mailing list