[Tutor] Please look at my wordFrequency.py

Dick Moores rdm at rcblue.com
Tue Oct 11 10:29:09 CEST 2005


Thank you, Andrew, for your wise and thoughtful comments.

Andrew P wrote at 23:58 10/10/2005:
>If it makes you feel any better, this isn't an easy problem to get 100% 
>right, traditionally.  Heck, it might not even be possible.  A series of 
>compromises might be  the best you can hope for.

Yes, I gradually began to realize that. And after hearing from you, even 
more so.

>Some things to think about, however.  Can you choose the characters you 
>want, instead of the (many, many) characters you don't want?  This might 
>simplify matters.

I tried it both ways, and settled on specifying the characters I don't want.

chars = ".,!?;:&*'\"=\\+-><][/#@$%)("
and then
newWord = word.strip(chars)

This way I could first get rid of all those non-alphanumeric characters 
on the outsides of words, and leave alone the ones inside words, such as 
hyphens (man-eating), periods (S.P.E.C.T.R.E), and apostrophes (dog's). 
This seemed simpler and easier to visualize.

>What do most words look like in English?

Yes, I gave that a lot of thought.

>  When is a hyphen part of a word,
>  what about dashes?  A dash in the middle of a sentence means something 
> different than one at the end of line.

"A dash in the middle gets replaced by a space--creating two 
words."  Thus "space--creating" becomes  "space creating" (two words).

A dash at the end of a line is also removed.

Now hyphens are a different problem, one I didn't solve. "man-eating 
tiger":  A hyphen in the middle of a word gets the hyphen left in place 
("man-eating tiger"). All others would be removed. I decided to not try 
to handle those cases where a hyphen is used at the end of a line to 
place the first syllable or two of a long word at the end of a line, and 
the remaining syllables at the beginning of the next line. 
Impressionistically, it seemed to me that most text in digital form 
doesn't split words this way. Take, for example, articles on newspaper 
websites. In their tree-wasting form, with narrow columns, the reverse is 
true: many word-splitting hyphens at the ends of lines.

>As far as other special/impossible cases...what is the difference 
>between dogs' and 'the dogs' when you are counting words?

Dogs' stays as dogs'; 'the dogs' becomes the two words the and dogs'. 
Although the text in my example, David Copperfield, is BE, I am actually 
aiming at AE, where most quotes are double. Thus dogs' and "the dogs". 
For these, dogs' stays as the word dogs', and "the dogs" is the two 
words, the and dog. But now I'm not sure this is correct. I'd rather the 
possessive of dogs be treated as just another instance of dogs. But then 
possibly the plural dogs should be treated as an instance of dog. I 
realize now, thanks to you, that I didn't think this through sufficiently 
at all. Should knives be an instance of knife? And so on.

>   What about acronyms written like S.P.E.C.T.R.E

S.P.E.C.T.R.E --> s.p.e.c.t.r.e
I can live with the lower case.

>or words that include a number like 1st?

1st, 2nd remain as 1st, 2nd .

>  You can add point 5 to the list.  Which are more common cases, which 
> collide with other rules.  What's the minimum amount of rules you can 
> define to take the maximum chunk out of the problem?

I believe that's point 6. Yes, I've thought about that. But probably not 
well-enough.

>That is enough of my random rambling.
>
>A lot of it might magically fall into place once you try what Danny 
>suggested.  He is a smart guy.  Doing my best to have clearly defined, 
>self-contained functions that do a specific task usually helps to reduce 
>a problem to more manageable steps, and visualize what is happening more 
>clearly.

Yes, I'm certainly going to follow Danny's advice.

Dick

>On 10/10/05, Dick Moores <<mailto:rdm at rcblue.com>rdm at rcblue.com> wrote:
>Script is at:
><<http://www.rcblue.com/Python/wordFrequency/wordFrequency.txt>http://www.rcblue.com/Python/wordFrequency/wordFrequency.txt>
>
>Example text file for input:
>< 
>http://www.rcblue.com/Python/wordFrequency/first3000linesOfDavidCopperfield.txt>
>(142 kb)
>(from 
><<http://www.gutenberg.org/etext/766>http://www.gutenberg.org/etext/766>)
>
>Example output in file:
><<http://www.rcblue.com/Python/wordFrequency/outputToFile.txt>http://www.rcblue.com/Python/wordFrequency/outputToFile.txt>
>(40 kb)
>
>(Execution took about 30 sec. with my computer.)
>
>I worked on this a LONG time for something I expected to just be an easy
>and possibly useful exercise. Three times I started completely over with
>a new approach. Had a lot of trouble removing exactly the characters I
>didn't want to appear in the output. Wished I knew how to debug other
>than just by using a lot of print statements.
>
>Specifically, I'm hoping for comments on or help with:
>1) How to debug. I'm using v2.4, IDLE on Win XP.
>2) I've tried to put in remarks that will help most anyone to understand
>what the code is doing. Have I succeeded?
>3) No modularization. Couldn't see a reason to do so. Is there one or two?
>Specifically, what sections should become modules, if any?
>4) Variable names. I gave up on making them self-explanatory. Instead, I
>put in some remarks near the top of the script (lines 6-10) that I hope
>do the job. Do they? In the code, does the "L to newL to L to newL to L"
>kind of thing remain puzzling?
>
>(lines 6-10)
># meaning of short variable names:
>#   S is a string
>#   c is a character of a string
>#   L, F are lists
>#   e is an element of a list
>
>5) Ideally, abbreviations that end in a period, such as U.N., e.g., i.e.,
>viz. op. cit., Mr. (Am. E.), etc., should not be stripped of their final
>periods (whereas other words that end a sentence SHOULD be stripped). I
>tried making and using a Python list of these, but it was too tough to
>write the code to use it. Any ideas? (I can live very easily without a
>solution to point 5, because if the output shows there are 10 "e.g"s,
>I'll just assume, and I think safely, that there actually are 10 "e.g."s.
>But I am curious, Pythonically.)
>
>Thanks very much in advance, tutors.
>
>Dick Moores
><mailto:rdm at rcblue.com>rdm at rcblue.com




More information about the Tutor mailing list