[Edu-sig] How do we tell truths that might hurt

Terry Hancock hancock at anansispaceworks.com
Fri Apr 23 00:18:00 EDT 2004

On Thursday 22 April 2004 07:25 am, Anna Ravenscroft wrote:
> On Wednesday 21 April 2004 19:36, ajsiegel at optonline.net wrote:
> > Yes, a Literature student might be enticed to know that programming could
> > be made useful in finding semantic patterns in the works of Joyce. The
> > problem is that its hard.  
> Yep - it's hard. And continuing to present programming that way is going to 
> *keep* people away.  
> So you start with something easier. 
> Do a simple word count program - find out the number of occurrences of "word" 
> in a particular file -a fairly simple program that shows the prospective 
> learner that they can use it for things *they're* interested in, without 
> having to be a wizard! Once they can do that, show them how they can turn it 
> into a concordance program. 

Just for the heck of it one day I did a little interactive
programming with Python to find the 500 most common words (in
order) that occured in a combination of three highly unrelated
Project Gutenberg texts.  I was trying to create a "throw-out"
list for a program I use in my forum software that I'm writing
to convert subject lines like "A Thousand and One Arabian Nights"
to a mnemonic legal id such as "1001_arabian_nights".

The idea is to make URLs that aren't too hard to remember instead
of "topic_4348def3203ea339" or something equally nonsensical.

An affectation, perhaps.  But I wanted to do it that way.

The task of finding the common words wasn't really too difficult.

First you read the whole file into Python (Python is *so* awesome)
then you replace all the punctuation characters and stuff with
whitespace, then you tokenize it, then you convert it all to
lower case.  Then you build a map:

# this only does one file -- but you could collect three files
# as the source material.

import re, string

illegal_re = re.compile(r'[^a-zA-Z\s]+')
# Match any set of one or more characters not a letter or whitespace

words = [w for w in illegal_re.sub(' ', open('myfile', 'r').read()).lower().split() if len(w)>2]
# Note the list comp is to ditch 1 and 2 character words which I don't care about
# because I can safely toss ALL of those before culling most common words.

word_freq = {}
for word in words:
    word_freq[word] = word_freq.get(word, 0) + 1

word_freq = word_freq.items()
word_freq.sort(lambda a,b: cmp(b[1],a[1]))

for word, freq in word_freq[:500]:
    print "%30s  %10d" % (word, freq)

Now that wasn't terribly difficult, was it? ;-)
This is the sort of thing that can easily be
done as a non-mathematical programming problem.

I didn't actually do this as a program, originally.
I was just tinkering in the interpreter, using
Python as an interactive data language.


Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks  http://www.anansispaceworks.com

More information about the Edu-sig mailing list