[Tutor] Please look at my wordFrequency.py
kent37 at tds.net
Tue Oct 11 19:37:44 CEST 2005
Kent Johnson wrote:
> Dick Moores wrote:
>> Kent Johnson wrote at 03:24 10/11/2005:
>>> Dick Moores wrote:
>>>> (Execution took about 30 sec. with my computer.)
>>> That's way too long
>> How long would you expect? I've already made some changes but haven't
>> seen the time change much.
> A couple of seconds at most, unless you are running it on some dog
> computer. It's just not that much text and you should be able to process
> it in a couple of passes at most.
OK I couldn't resist. I took your program and ran it on my computer - took about 38 seconds and got the same results as you. Then I made the changes I outlined, and a few other similar ones, and got it down to 34 secs. Finally I made the change suggested by John Fouhy - to accumulate the counts in a dict - and the time went down to 0.23 seconds.
>> Also, I'm puzzled that whether or not psyco is employed makes no
>> difference in the time. Can you explain why?
> My guess is it's because you have so many O(n^2) elements in the code.
> You have to get your algorithm to be O(n).
In particular this code:
for word in L:
k = L.count(word)
if (k,word) not in F:
L.count() has to scan through the entire list (L) looking for a match with each word. So for each word, you are making 26140 string compares. The total number of compares is 26140*26140 or 683,299,600. That's a lot!
Then, for each word, you scan F for a match. Now you are doing tuple compares. The number of compares will increase as the length of F, but overall it will be about 26140*3700/2 or 48,359,000 compares.
Compare this to the dictionary version which just iterates L once, doing a dictionary lookup and write for each word.
The reason psyco doesn't make much difference is because all the time is spent in list.count() which is already C code.
More information about the Tutor