[Edu-sig] text compression as a learning tool
Seth David Schoen
schoen@loyalty.org
Mon, 28 May 2001 18:18:19 -0700
Timothy Wilson writes:
> Hi everyone,
>
> I'm trying to come up with some more comprehensive projects to supplement my
> Python class for next fall. I'm definitely going to do a little "clubhouse
> crypto" (thanks Kirby and Neal Stephenson for picquing my interest).
>
> Before undertaking my own study of text/data compression, I'd like to get
> the opinions of the list members here about whether you think this could be
> an appropriate subject for beginning to intermediate Python programmers.
>
> Any particular references to compression algorithms or other background
> materials?
I fondly remember inventing run-length encoding in junior high, so it
occurs to me that that's a good algorithm to work with.
RLE doesn't usually work much on text (since letters are rarely
repeated in natural languages in groups larger than two), but it's great
for graphics. Have you considered presenting the problem of
compressing pictures? If a picture is in an array, how can it be
written to a file in such a way that the file can be read back in and
the array reconstructed?
There is an "obvious" way with two nested loops, and if students can
understand that, maybe they can think of alternatives (of which the
simplest is RLE). Then the problem is which of these alternatives
typically produce shorter files than the "obvious" technique.
I think RLE is given as an exercise in _SICP_, but the book explains
how it works and then simply asks you to implement it. Is it possible
that, if you mention the problem, some student will just think of RLE
or some similar technique?
For text compression, you can look at digraph frequencies and trigraph
frequencies, or word frequencies. Many early commercial compression
applications from the age of the telegraph were not automated, but
were performed by humans using codebooks. There's a lot of literature
about this recently -- a few of the popular recent books on
cryptography discuss this. The codes were not _necessarily_ meant to
provide confidentiality, but they'd abbreviate common words. Then you
have to think about something like a quoting mechanism if you want to
retain the ability to express the abbreviations themselves. (Perhaps
that gets into a discussion of lossy versus lossless compression: a
system which can't reverse certain abbreviations -- for example, which
compresses several different synonyms into the same code -- will be
lossy because it can't restore the literal original text. But it
could still be useful.)
I think there was some material about those commercial codes in the very
interesting book _The Victorian Internet_ by Tom Standage. Compression
really got a huge boost from the telegraph because of how expensive it
could be to transit large quantities of information. (On the other
hand, the use of abbreviations is much older, and medieval manuscripts
are chock full of standardized abbreviations, including many which
have fallen into disuse now.)
So you could certainly approach text compression from the word level
by starting with word frequency counters. (Those are super-easy with
Python dictionaries!) And people can absolutely get useful
compression by these methods.
Unfortunately, a lot of recent highly-efficient techniques are
relatively difficult to understand. I think Huffman coding is often
thought of as the boundary between the elementary and the intricate.
You can probably _explain_ Huffman coding in a straightforward way,
but I doubt you can expect students to discover it for themselves.
Unfortunately, beyond Huffman coding, I've usually seen techniques
presented in a "cookbook" way -- you know, here is a formal
description of this algorithm, but "you are not expected to understand
this". This is true for presentations of LZ and LZW I've seen. (Is
it a bad idea to mention the issues around the LZW patent?)
There is a book about data compression called _The Data Compression
Book_ which has code for some modern algorithms in C, and I seem to
remember that it has a big bibliography.
http://dogma.net/markn/tdcb/tdcb.htm
You can also study the bzip algorithm:
http://www.bk.isy.liu.se/~svan/bwt/
The biggest open question for me is not whether this is a good area to
present to students -- I think it's a _great_ area to present to
students -- but to what extent you can expect them to develop their
own techniques. I can imagine lots of great exercises, but it's
equally easy to imagine a student coming back and saying "I have
absolutely no idea how to start here".
Important points:
- What different techniques work for different kinds of files?
- Is there a technique that works for every "kind of file"?
- What do we mean by a "kind of file"? How can a computer tell them
apart? How would a person tell them apart?
- Why are some techniques more effective in particular situations?
- Is there a limit to how effective a compression technique can be?
How good would the best imaginable technique be?
- Is there a compression technique which can compress any file?
- Is there such a thing as data that can't be compressed?
It might be worth keeping in mind that algorithmic information theory,
which deals with the implications of the final question, was
originally discovered by a (then) high school student (Gregory J.
Chaitin).
The connection between compression and Chaitin's work (which actually
leads to things like generalizations of Godel's theorems!) is very
exciting. See Chaitin's home page:
http://www.umcs.maine.edu/~chaitin/
--
Seth David Schoen <schoen@loyalty.org> | And do not say, I will study when I
Temp. http://www.loyalty.org/~schoen/ | have leisure; for perhaps you will
down: http://www.loyalty.org/ (CAF) | not have leisure. -- Pirke Avot 2:5