large dictionary creation takes a LOT of time.

Fri Apr 29 08:29:04 EDT 2005

In article <1114757468.367583.210840 at z14g2000cwz.googlegroups.com>,
 "possibilitybox" <possibilitybox at gmail.com> wrote:

> this code here:
> 
> 
> def wordcount(lines):
>     for i in range(len(lines)/8):
>         words = lines[i].split(" ")
>         if not locals().has_key("frequency"):
>             frequency = {}
>         for word in words:
>             if frequency.has_key(word):
>                 frequency[word] += 1
>             else:
>                 frequency[word] = 1
>     return frequency
> wordcount(lines)
> 
> is taking over six minutes to run on a two megabyte text file.  i
> realize that's a big text file, a really big one (it's actually the
> full text of don quixote.).

Something doesn't make sense with your timing.

I just downloaded the text of Don Quixote 
(http://www.gutenberg.org/dirs/9/9/996/996.txt).  It's about 2.3 Mbytes, 
428 kwords, 40 klines.  This includes the text of the novel itself, plus a 
little boilerplate text added by the Gutenberg Project folks.

I ran your program against it on my PowerBook (1 GHz PowerPC, Python 2.3.4, 
768 Mbytes RAM).  It took about 0.4 seconds.  When I got rid of the "/8" in 
the range() call (so it processed the whole text), it took about 1.8 
seconds (0.24 of which were just reading the file).  Some other posters 
reported similiar findings.

What kind of system are you running it on?  The only thing I can think of 
is that you've got way too little memory and your machine is just 
thrashing.  My Python process is about 31 Mb when it first starts up, grows 
to 35 when the file is read into a list, then gets to 38 after the call to 
wordcount().

> i'm trying to figure out how.  is there a
> better way for me to do a frequency count of all the words in the text?
>  it seems to me like this should scale linearly, but perhaps it isn't?
> i don't know much about algorithmic complexity.  if someone could give
> a breakdown of this functions complexity as well i'd be much obliged.

Well, I don't see anything in your code which isn't linear.  There are some 
places where you could do things a little more efficiently, but these would 
all be replacing one linear process with a better linear process.  Small 
speedups, but not the drastic kind of speedups you would expect by 
replacing a quadratic process with a linear one.

Here's a few things I would change:

>         if not locals().has_key("frequency"):
>             frequency = {}

This is kind of silly.  Just factor this out of the main loop, and do 
"frequency = {}" before the first for loop.  This won't speed things up 
other than trivially, but it'll make your code easier to read and 
understand.

Next, replace

> for i in range(len(lines)):
>         words = lines[i].split(" ")

with something like

> for line in lines:
>    words = line.split()

It's marginally faster, easier to read, and is actually more correct; 
calling split() with no arguments makes it split on arbitrary white space, 
which is probably what you really want:

>>> "foo bar    baz".split(" ")
['foo', 'bar', '', '', '', 'baz']

>>> "foo bar    baz".split()
['foo', 'bar', 'baz']

And lastly, instead of

if frequency.has_key(word):
>                 frequency[word] += 1
>             else:
>                 frequency[word] = 1

I would do:

try:
   frequency[word] += 1
except KeyError:
   frequency[word] = 1

which is usually a little bit faster.  Somebody else mentioned using the 
get() method of dictionaries here; that might be even better, but I learned 
the try/except trick before get() existed, so I tend to stick to that :-)

But, as I said, all of these are just minor tweaks.  Overall, your code 
looks like it should run in O(n), so fixing your code is not where you 
should be looking.  Something external (i.e. memory thrashing or some other 
exceptionally bad bit of system performance) has to be causing the 
horrendously bad performance you're seeing, and that's where you should be 
concentrating your efforts.

BTW, if you suspected you had some kind of non-linear algorithm, and didn't 
trust code inspection to verify its existance, you could just run your 
program a bunch of times on different sized data sets, and plot runtime vs. 
input size to see what kind of curve you get.