[Tutor] Please look at my wordFrequency.py

Dick Moores rdm at rcblue.com
Tue Oct 11 21:10:59 CEST 2005


Kent Johnson wrote at 10:37 10/11/2005:
>Kent Johnson wrote:
> > Dick Moores wrote:
> >
> >> Kent Johnson wrote at 03:24 10/11/2005:
> >>
> >>> Dick Moores wrote:
> >>>
> >>>> (Execution took about 30 sec. with my computer.)
> >>>
> >>>
> >>> That's way too long
> >>
> >>
> >>
> >> How long would you expect? I've already made some changes but haven't
> >> seen the time change much.
> >
> >
> > A couple of seconds at most, unless you are running it on some dog
> > computer. It's just not that much text and you should be able to process
> > it in a couple of passes at most.
>
>OK I couldn't resist. I took your program and ran it on my computer - 
>took about 38 seconds and got the same results as you. Then I made the 
>changes I outlined, and a few other similar ones, and got it down to 34 secs.

Yes, that's about the difference I was seeing. Thanks for taking the 
trouble. I went from 30 to 27. With no regex use (don't understand it yet).

>  Finally I made the change suggested by John Fouhy - to accumulate the 
> counts in a dict - and the time went down to 0.23 seconds.

WOW! I didn't implement John's change because I didn't understand it. 
Haven't dealt with dictionaries yet.

> >> Also, I'm puzzled that whether or not psyco is employed makes no
> >> difference in the time. Can you explain why?
> >
> >
> > My guess is it's because you have so many O(n^2) elements in the code.
> > You have to get your algorithm to be O(n).

OK, I finally bit the bullet, googled O(n^2), and read about Big O 
notation at <http://en.wikipedia.org/wiki/Big_O_notation> and 
<http://www.cprogramming.com/tutorial/computersciencetheory/algorithmicefficiency1.html>
I think I've now got at least the basic idea.

>In particular this code:
>for word in L:
>     k = L.count(word)
>     if (k,word) not in F:
>         F.append((k,word))
>
>L.count() has to scan through the entire list (L) looking for a match 
>with each word. So for each word, you are making 26140 string compares. 
>The total number of compares is 26140*26140 or 683,299,600. That's a lot!
>Then, for each word, you scan F for a match. Now you are doing tuple 
>compares. The number of compares will increase as the length of F, but 
>overall it will be about 26140*3700/2 or 48,359,000 compares.

Kent, you're beginning to get thru to me. Thanks for the details and the 
numbers.

>Compare this to the dictionary version which just iterates L once, doing 
>a dictionary lookup and write for each word.
>
>The reason psyco doesn't make much difference is because all the time is 
>spent in list.count() which is already C code.

Ah. But how can I know what is in C code and what isn't? For example, in 
a previous post you say that L.extend(e) is in C, and imply that 
L.append(e) isn't, and that therefore L.extend(e) should be used.

Well, back to Hetlands' Beginning Python.

Dick  



More information about the Tutor mailing list