[Edu-sig] How do we tell truths that might hurt

Terry Hancock hancock at anansispaceworks.com
Sat Apr 24 11:15:38 EDT 2004


On Friday 23 April 2004 05:28 am, Arthur wrote:
> Terry - 
> > 
> > import re, string
> > 
> > illegal_re = re.compile(r'[^a-zA-Z\s]+')
> 
> Huh?

Well, it does what the comment you *snipped out* said it
does!  Want to figure it out yourself? This is a trivial
example of a regex -- see http://www.python.org/doc/2.3.3/lib/re-syntax.html
should take you less than 10 minutes to get this one.

That's why I haven't bothered to break it up symbolically,
the way you can do in Python:

any_letter = r'a-zA-Z'
whitespace = r'\s'
not_in    = r'[^%s]'
atleast_one = r'%s+'

illegal_re = atleast_one % (not_in % (any_letter + whitespace))

In other words, any run of at least one character not in the set
of any letter or whitespace is matched as "illegal".  Later on,
we use this to replace them with spaces.

> > words = [w for w in illegal_re.sub(' ', open('myfile',
> > 'r').read()).lower().split() if len(w)>2]
> 
> Whoa?

This kind of functional approach is very compact and is easy
to debug in the interactive shell.  It's sort of like writing
an equation with the variables expanded.  If I were doing something
really tricky, I'd probably expand it, but it's just an example. ;-)

You could also write it like this:

file_where_my_text_is_located    = open('myfile', 'r')
string_from_the_file             = file_where_my_text_is_located.read()
lowercase_string_from_file       = string_from_the_file.lower()
string_containing_only_the_words = illegal_re.sub(' ', lowercase_string_from_file)
all_the_words_in_sequence        = string_containing_only_the_words.split()
all_the_words_3_or_more_chars    = [w for w in all_the_words_in_sequence if len(w)>2]

words = all_the_words_3_or_more_chars

Are we happy, now, Art?  :-P

:-D

> > word_freq = {}
> > for word in words:
> >     word_freq[word] = word_freq.get(word, 0) + 1

Well known algorithm (maybe "trick" is the right word),
to extract frequency data from a list, taking advantage
of the properties of dictionaries.  When a word is
encountered you increment that word's mapping by 1. If
it's not there, it gets put there with a frequency of 1.

> > word_freq = word_freq.items()
> > word_freq.sort(lambda a,b: cmp(b[1],a[1]))
> 
> Hmmm?

Oh, come on that's so obvious.  This is how you sort a
dictionary.  You make it a list of tuples then sort that.
The only remotely difficult thing is that you have to
sort on the *second* element of the tuple, which is not
the default.  Otherwise the listing would've been alphabetical.
(Which might be useful for some other purpose).

> > for word, freq in word_freq[:500]:
> >     print "%30s  %10d" % (word, freq)
> 
> ??

What didn't you understand?  I'm printing the first 500 elements
of the above list so they're fairly easy to read.  Pipe it through
'less' or something if you want to page through it.  Sheesh, I
could've made it print in columns if I wanted to be fancy.

You do realize I typed this program straight into my
mail client off the top of my head -- I never saved
the original.  I did paste and test in Python to make
sure it would work, though.  I wasn't exactly trying
to meet style standards. ;-)

> > Now that wasn't terribly difficult, was it? ;-)
> 
> Piece of cake!

I'd say it was about as easy, if not easier than writing a
function kepler() to solve the (transcendental) inverse of
Kepler's equation (solves M = E - e * sin(E) for 'E', i.e.
E = kepler(M)). Which was the first real program I ever
wrote in FORTRAN (at least the first one that ever compiled).

For example, I don't think I could do that off the top of
my head, like I did with this puzzle. I'd have to go look
up "Newton's Method" or whatever as a standard algorithm
(sorry to say, I have forgotten it).  Both programs would
be about the same length though -- no more than 20 LoC.

I don't think any knowledge domain is particularly more or
less difficult -- it's just a question of what you're
interested in learning.  Apparently you don't find the
pattern-recognition capabilities of regexes very interesting,
because they don't allow you to do things that interest
you. So you view them as opaque, like a student who skips
the equations when reading their textbook.

Some people feel that way about geometry.

I find both pretty interesting, myself. ;-)

Cheers,
Terry

--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks  http://www.anansispaceworks.com




More information about the Edu-sig mailing list