[Tutor] Please look at my wordFrequency.py

Tue Oct 11 18:06:49 CEST 2005

Kent Johnson wrote at 03:24 10/11/2005:
>Dick Moores wrote:
> > (Execution took about 30 sec. with my computer.)
>
>That's way too long

How long would you expect? I've already made some changes but haven't 
seen the time change much.

> >
> > Specifically, I'm hoping for comments on or help with:
> > 2) I've tried to put in remarks that will help most anyone to understand
> > what the code is doing. Have I succeeded?
>
>Yes, i think so

Good.

> > 3) No modularization. Couldn't see a reason to do so. Is there one or 
> two?
> > Specifically, what sections should become modules, if any?
>
>As Danny says, breaking it up into functions makes it easier to 
>understand and test

OK.

> > 4) Variable names. I gave up on making them self-explanatory. Instead, I
> > put in some remarks near the top of the script (lines 6-10) that I hope
> > do the job. Do they? In the code, does the "L to newL to L to newL to L"
> > kind of thing remain puzzling?
>
>Some of your variables seem unnecessary. For example
>     newWord = word.strip(chars)
>     word = newWord
>could be just
>     word = word.strip(chars)

Yes, I'll have to get that kind of thing straightened out. In my mind, 
first of all.

> > 5) Ideally, abbreviations that end in a period, such as U.N., e.g., 
> i.e.,
> > viz. op. cit., Mr. (Am. E.), etc., should not be stripped of their final
> > periods (whereas other words that end a sentence SHOULD be stripped). I
> > tried making and using a Python list of these, but it was too tough to
> > write the code to use it. Any ideas?
>
>You should be able to do this with regular expressions or searching in 
>the word. You want to test for a word that ends with a period but 
>doesn't include any periods. Something like
>if word.endswith('.') and '.' not in word[:-1]:
>   word = word[:-1]

Nice! That takes care of U.N., e.g., i.e., but not viz., op. cit., or Mr.

>Other notes:
>Use re.split() to do all the splits at once. Something like
>   L = re.split(r'\s+|--|/', textAsString)

Don't understand this yet. I'll work on it.

>#remove empty elements in L
>while "" in L:
>     L.remove("")
>The above iterates L twice for each empty word!

I don't get the twice. Could you spell it out, please?

>The remove() calls are expensive too because the remaining elements of L 
>must be shifted down. Do the whole thing in one pass over L with
>     L = [ w for w in L if w ]
>You only need to remove empty elements once, when the rest of the 
>processing is done.

Got it. But using this doesn't seem to make much difference in the time.

Also, I'm puzzled that whether or not psyco is employed makes no 
difference in the time. Can you explain why?

>for e in saveRemovedForLaterL:
>     L.append(e)
>could be
>L.extend(e)

Are you recommending L.extend(e), or is it just another way to do it?

Thanks very much for your help, Kent.

Dick