[Tutor] fine in interpreter, hangs in batch

Mon Mar 19 18:23:17 CET 2007

Switanek, Nick wrote:
> Thanks very much for your help.
> 
> I did indeed neglect to put the "print" in the code that I sent to the
> list.
> 
> It appears that the step that is taking a long time, and that therefore
> makes me think that the script is somehow broken, is creating a
> dictionary of frequencies from the list of ngrams. To do this, I've
> written, for example:
> 
> bigramDict = {}
> bigrams = [' '.join(wordlist[i:i+2]) for i in range(len(wordlist)-2+1)]
> for bigram in bigrams:
> 	if bigram in bigramDict.keys(): bigramDict[bigram] += 1
> 	else: bigramDict[bigram] = 1

Ouch! bigramDict.keys() creates a *new* *list* of all the keys in 
bigramDict. You then search the list - a linear search! - for bigram. 
I'm not surprised that this gets slow.

If you change that line to
   if bigram in bigramDict: bigramDict[bigram] += 1
you should see a dramatic improvement.

Kent

> 
> 
> With around 500,000 bigrams, this is taking over 25 minutes to run (and
> I haven't sat around to let it finish) on an XP machine at 3.0GHz and
> 1.5GB RAM. I bet I'm trying to reinvent the wheel here, and that there
> are faster algorithms available in some package. I think possibly an
> indexing package like PyLucene would help create frequency dictionaries,
> but I can't figure it out from the online material available. Any
> suggestions?
> 
> Thanks,
> Nick