[Tutor] Please look at my wordFrequency.py

Tue Oct 11 21:10:59 CEST 2005

Kent Johnson wrote at 10:37 10/11/2005:
>Kent Johnson wrote:
> > Dick Moores wrote:
> >
> >> Kent Johnson wrote at 03:24 10/11/2005:
> >>
> >>> Dick Moores wrote:
> >>>
> >>>> (Execution took about 30 sec. with my computer.)
> >>>
> >>>
> >>> That's way too long
> >>
> >>
> >>
> >> How long would you expect? I've already made some changes but haven't
> >> seen the time change much.
> >
> >
> > A couple of seconds at most, unless you are running it on some dog
> > computer. It's just not that much text and you should be able to process
> > it in a couple of passes at most.
>
>OK I couldn't resist. I took your program and ran it on my computer - 
>took about 38 seconds and got the same results as you. Then I made the 
>changes I outlined, and a few other similar ones, and got it down to 34 secs.

Yes, that's about the difference I was seeing. Thanks for taking the 
trouble. I went from 30 to 27. With no regex use (don't understand it yet).

>  Finally I made the change suggested by John Fouhy - to accumulate the 
> counts in a dict - and the time went down to 0.23 seconds.

WOW! I didn't implement John's change because I didn't understand it. 
Haven't dealt with dictionaries yet.

> >> Also, I'm puzzled that whether or not psyco is employed makes no
> >> difference in the time. Can you explain why?
> >
> >
> > My guess is it's because you have so many O(n^2) elements in the code.
> > You have to get your algorithm to be O(n).

OK, I finally bit the bullet, googled O(n^2), and read about Big O 
notation at <http://en.wikipedia.org/wiki/Big_O_notation> and 
<http://www.cprogramming.com/tutorial/computersciencetheory/algorithmicefficiency1.html>
I think I've now got at least the basic idea.

>In particular this code:
>for word in L:
>     k = L.count(word)
>     if (k,word) not in F:
>         F.append((k,word))
>
>L.count() has to scan through the entire list (L) looking for a match 
>with each word. So for each word, you are making 26140 string compares. 
>The total number of compares is 26140*26140 or 683,299,600. That's a lot!
>Then, for each word, you scan F for a match. Now you are doing tuple 
>compares. The number of compares will increase as the length of F, but 
>overall it will be about 26140*3700/2 or 48,359,000 compares.

Kent, you're beginning to get thru to me. Thanks for the details and the 
numbers.

>Compare this to the dictionary version which just iterates L once, doing 
>a dictionary lookup and write for each word.
>
>The reason psyco doesn't make much difference is because all the time is 
>spent in list.count() which is already C code.

Ah. But how can I know what is in C code and what isn't? For example, in 
a previous post you say that L.extend(e) is in C, and imply that 
L.append(e) isn't, and that therefore L.extend(e) should be used.

Well, back to Hetlands' Beginning Python.

Dick