[Tutor] Please look at my wordFrequency.py
grouch at gmail.com
Tue Oct 11 08:58:47 CEST 2005
If it makes you feel any better, this isn't an easy problem to get 100%
right, traditionally. Heck, it might not even be possible. A series of
compromises might be the best you can hope for.
Some things to think about, however. Can you choose the characters you want,
instead of the (many, many) characters you don't want? This might simplify
What do most words look like in English? When is a hyphen part of a word,
what about dashes? A dash in the middle of a sentence means something
different than one at the end of line.
As far as other special/impossible cases...what is the difference between
dogs' and 'the dogs' when you are counting words? What about acronyms
written like S.P.E.C.T.R.E, or words that include a number like 1st? You can
add point 5 to the list. Which are more common cases, which collide with
other rules. What's the minimum amount of rules you can define to take the
maximum chunk out of the problem?
That is enough of my random rambling.
A lot of it might magically fall into place once you try what Danny
suggested. He is a smart guy. Doing my best to have clearly defined,
self-contained functions that do a specific task usually helps to reduce a
problem to more manageable steps, and visualize what is happening more
On 10/10/05, Dick Moores <rdm at rcblue.com> wrote:
> Script is at:
> Example text file for input:
> (142 kb)
> (from <http://www.gutenberg.org/etext/766>)
> Example output in file:
> (40 kb)
> (Execution took about 30 sec. with my computer.)
> I worked on this a LONG time for something I expected to just be an easy
> and possibly useful exercise. Three times I started completely over with
> a new approach. Had a lot of trouble removing exactly the characters I
> didn't want to appear in the output. Wished I knew how to debug other
> than just by using a lot of print statements.
> Specifically, I'm hoping for comments on or help with:
> 1) How to debug. I'm using v2.4, IDLE on Win XP.
> 2) I've tried to put in remarks that will help most anyone to understand
> what the code is doing. Have I succeeded?
> 3) No modularization. Couldn't see a reason to do so. Is there one or two?
> Specifically, what sections should become modules, if any?
> 4) Variable names. I gave up on making them self-explanatory. Instead, I
> put in some remarks near the top of the script (lines 6-10) that I hope
> do the job. Do they? In the code, does the "L to newL to L to newL to L"
> kind of thing remain puzzling?
> (lines 6-10)
> # meaning of short variable names:
> # S is a string
> # c is a character of a string
> # L, F are lists
> # e is an element of a list
> 5) Ideally, abbreviations that end in a period, such as U.N., e.g., i.e.,
> viz. op. cit., Mr. (Am. E.), etc., should not be stripped of their final
> periods (whereas other words that end a sentence SHOULD be stripped). I
> tried making and using a Python list of these, but it was too tough to
> write the code to use it. Any ideas? (I can live very easily without a
> solution to point 5, because if the output shows there are 10 "e.g"s,
> I'll just assume, and I think safely, that there actually are 10 "e.g."s.
> But I am curious, Pythonically.)
> Thanks very much in advance, tutors.
> Dick Moores
> rdm at rcblue.com
> Tutor maillist - Tutor at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Tutor