general thoughts

Wed Sep 25 11:47:40 EDT 2002

I've had the same problem from time to time in my programming
explorations of the last few years. One of my interests is language,
which you mentioned as your field of study. I played for a long time
with some fun language stuff - no real scientific value, but
interesting, fun, not difficult to understand, and let me explore lots
of ways of doing things.

I decided to work on the idea that given enough time, monkeys could
randomly reproduce Shakespeare. I thought it would be fun to simulate
the monkeys statistically. First I wrote a program to read in a text
file and calculate the percentage frequency of each character, then
use random number generator to generate text. The larger the text, the
'more correct' distribution of letters (28 characters from 26 letters,
space, and apostrophe), so it would read in e-books by Poe,
Dostoevsky, etc (Shakespeare after removing plays' character-cue lines
like 'HAMLET: blah blah' so the cues wouldn't warp the frequencies.

Then I went to a second-order analysis, calculating frequencies for
each 2-letter group, 'aa, ab, ac, ..., th, ... zz' and generating from
that. That produced much more valid output, as in more english words,
better spacing and such. At third level groups (called it third order)
it produces mostly words, occasional valid short phrases and such.

Of course the interesting thing is trying different authors because
the output is recognizable as a particular author's. Similarly other
languages (german, french, etc) are recognizable in first order, but
in higher orders produce just as interesting output as teh english -
language is irrelevant, because words are built on letter frequencies.

Of course, at higher order you get sentences and all sorts of
interesting things out. The statistics can be used for
author-identification, encryption, test compression, and all sorts of
things. There is a measure of entropy in language (not real sure what
it means) that can be taken and compared as well. Really neat project
and fun programming, because it produces real results, encourages
repeated optimization tests. After I had the basic operations going I
had fun building a menu application around it with all sorts of tools,
results display, automation, etc. great stuff. this was all in QBasic,
but I'm eventually going to move it to Python. Also I have the
analysis engine written in C that I will probably call, as it is much
faster than QBasic or python for the actual text processing. but
python is fast enough for up through 4th order.

I bet you'd like something like this. It will probably give you ideas
for other projects as well.

Carl