Hi everyone, I'm trying to come up with some more comprehensive projects to supplement my Python class for next fall. I'm definitely going to do a little "clubhouse crypto" (thanks Kirby and Neal Stephenson for picquing my interest). Before undertaking my own study of text/data compression, I'd like to get the opinions of the list members here about whether you think this could be an appropriate subject for beginning to intermediate Python programmers. Any particular references to compression algorithms or other background materials? Other thoughts? -Tim -- Tim Wilson | Visit Sibley online: | Check out: Henry Sibley HS | http://www.isd197.org | http://www.zope.org W. St. Paul, MN | | http://slashdot.org wilson@visi.com | <dtml-var pithy_quote> | http://linux.com
Timothy Wilson writes:
Hi everyone,
I'm trying to come up with some more comprehensive projects to supplement my Python class for next fall. I'm definitely going to do a little "clubhouse crypto" (thanks Kirby and Neal Stephenson for picquing my interest).
Before undertaking my own study of text/data compression, I'd like to get the opinions of the list members here about whether you think this could be an appropriate subject for beginning to intermediate Python programmers.
Any particular references to compression algorithms or other background materials?
I fondly remember inventing run-length encoding in junior high, so it occurs to me that that's a good algorithm to work with. RLE doesn't usually work much on text (since letters are rarely repeated in natural languages in groups larger than two), but it's great for graphics. Have you considered presenting the problem of compressing pictures? If a picture is in an array, how can it be written to a file in such a way that the file can be read back in and the array reconstructed? There is an "obvious" way with two nested loops, and if students can understand that, maybe they can think of alternatives (of which the simplest is RLE). Then the problem is which of these alternatives typically produce shorter files than the "obvious" technique. I think RLE is given as an exercise in _SICP_, but the book explains how it works and then simply asks you to implement it. Is it possible that, if you mention the problem, some student will just think of RLE or some similar technique? For text compression, you can look at digraph frequencies and trigraph frequencies, or word frequencies. Many early commercial compression applications from the age of the telegraph were not automated, but were performed by humans using codebooks. There's a lot of literature about this recently -- a few of the popular recent books on cryptography discuss this. The codes were not _necessarily_ meant to provide confidentiality, but they'd abbreviate common words. Then you have to think about something like a quoting mechanism if you want to retain the ability to express the abbreviations themselves. (Perhaps that gets into a discussion of lossy versus lossless compression: a system which can't reverse certain abbreviations -- for example, which compresses several different synonyms into the same code -- will be lossy because it can't restore the literal original text. But it could still be useful.) I think there was some material about those commercial codes in the very interesting book _The Victorian Internet_ by Tom Standage. Compression really got a huge boost from the telegraph because of how expensive it could be to transit large quantities of information. (On the other hand, the use of abbreviations is much older, and medieval manuscripts are chock full of standardized abbreviations, including many which have fallen into disuse now.) So you could certainly approach text compression from the word level by starting with word frequency counters. (Those are super-easy with Python dictionaries!) And people can absolutely get useful compression by these methods. Unfortunately, a lot of recent highly-efficient techniques are relatively difficult to understand. I think Huffman coding is often thought of as the boundary between the elementary and the intricate. You can probably _explain_ Huffman coding in a straightforward way, but I doubt you can expect students to discover it for themselves. Unfortunately, beyond Huffman coding, I've usually seen techniques presented in a "cookbook" way -- you know, here is a formal description of this algorithm, but "you are not expected to understand this". This is true for presentations of LZ and LZW I've seen. (Is it a bad idea to mention the issues around the LZW patent?) There is a book about data compression called _The Data Compression Book_ which has code for some modern algorithms in C, and I seem to remember that it has a big bibliography. http://dogma.net/markn/tdcb/tdcb.htm You can also study the bzip algorithm: http://www.bk.isy.liu.se/~svan/bwt/ The biggest open question for me is not whether this is a good area to present to students -- I think it's a _great_ area to present to students -- but to what extent you can expect them to develop their own techniques. I can imagine lots of great exercises, but it's equally easy to imagine a student coming back and saying "I have absolutely no idea how to start here". Important points: - What different techniques work for different kinds of files? - Is there a technique that works for every "kind of file"? - What do we mean by a "kind of file"? How can a computer tell them apart? How would a person tell them apart? - Why are some techniques more effective in particular situations? - Is there a limit to how effective a compression technique can be? How good would the best imaginable technique be? - Is there a compression technique which can compress any file? - Is there such a thing as data that can't be compressed? It might be worth keeping in mind that algorithmic information theory, which deals with the implications of the final question, was originally discovered by a (then) high school student (Gregory J. Chaitin). The connection between compression and Chaitin's work (which actually leads to things like generalizations of Godel's theorems!) is very exciting. See Chaitin's home page: http://www.umcs.maine.edu/~chaitin/ -- Seth David Schoen <schoen@loyalty.org> | And do not say, I will study when I Temp. http://www.loyalty.org/~schoen/ | have leisure; for perhaps you will down: http://www.loyalty.org/ (CAF) | not have leisure. -- Pirke Avot 2:5
On Mon, 28 May 2001, Seth David Schoen wrote:
I fondly remember inventing run-length encoding in junior high, so it occurs to me that that's a good algorithm to work with.
That reminds me of all the things I "invented" as a kid that I soon discovered had been previously invented. I was so close to fame and riches. :-) -Tim -- Tim Wilson | Visit Sibley online: | Check out: Henry Sibley HS | http://www.isd197.org | http://www.zope.org W. St. Paul, MN | | http://slashdot.org wilson@visi.com | <dtml-var pithy_quote> | http://linux.com
Timothy Wilson writes:
On Mon, 28 May 2001, Seth David Schoen wrote:
I fondly remember inventing run-length encoding in junior high, so it occurs to me that that's a good algorithm to work with.
That reminds me of all the things I "invented" as a kid that I soon discovered had been previously invented. I was so close to fame and riches. :-)
Bill Gosper, the genius and accomplished hacker, gave a great interview where he expressed this amusing jealousy for Euler, who had the good fortune to have been born first. Donald J. Albers et al., _More Mathematical People_, p. 106. Great book. There's also the comment that St. Jerome recounted about Aelius Donatus: "Pereant," inquit, "qui ante nos nostra dixerunt." ("May they perish," [Aelius Donatus] said, "who have said our own things before us!") -- Seth David Schoen <schoen@loyalty.org> | And do not say, I will study when I Temp. http://www.loyalty.org/~schoen/ | have leisure; for perhaps you will down: http://www.loyalty.org/ (CAF) | not have leisure. -- Pirke Avot 2:5
On Mon, 28 May 2001, Timothy Wilson wrote:
I'm trying to come up with some more comprehensive projects to supplement my Python class for next fall. I'm definitely going to do a little "clubhouse crypto" (thanks Kirby and Neal Stephenson for picquing my interest).
Before undertaking my own study of text/data compression, I'd like to get the opinions of the list members here about whether you think this could be an appropriate subject for beginning to intermediate Python programmers.
This sounds like a really fun subject! Yes, it sounds very appropriate. I've written a small module to implement the Sandorf cypher, which is pretty fun to play with. The Sandorf cypher description is here: http://acm.uva.es/p/v7/795.html and this code: http://www.lowerstandard.com/python/sandorf.py implements the encryption and decryption routines.
Any particular references to compression algorithms or other background materials?
You may find this page useful: http://www.cs.sfu.ca/cs/CC/365/li/squeeze/ This page gives tutorials on lossless compression, including stuff on huffman encoding and lzw. Another fun project might be to implement run-length encoding on black and white icons.
Not what you are asking for and it's not crypto.. There is a cool Python API for WordNet - fascinatating in its own right and could open up a different kind of crypto based on larger granularity of text chunks, meanings and associations. It might be challenging and enlightening to deal programatically with the English language while learning to speak and think in the Python language. http://www.cs.brandeis.edu/~steele/sources/wordnet-python.html <quote> The wordnet.py module is a Python interface to the Wordnet 1.6 database. WordNet is a database of word meanings and lexical relationships, such as synonym, antonym, hypernym ("poodle"->"dog"), and hyponym ("poodle"->"dog"). wordnet.py presents a concise interface to WordNet, that allows the user to type expressions such as N['dog'], hyponyms(N['dog'][0]), and closure(ADJ['red'], SYNONYM) to query the database. </quote> ./Jason ----- Original Message ----- From: "Timothy Wilson" <wilson@visi.com> To: "Python-Edu SIG" <edu-sig@python.org> Sent: Monday, May 28, 2001 5:40 PM Subject: [Edu-sig] text compression as a learning tool
Hi everyone,
I'm trying to come up with some more comprehensive projects to supplement my Python class for next fall. I'm definitely going to do a little "clubhouse crypto" (thanks Kirby and Neal Stephenson for picquing my interest).
Before undertaking my own study of text/data compression, I'd like to get the opinions of the list members here about whether you think this could be an appropriate subject for beginning to intermediate Python programmers.
Any particular references to compression algorithms or other background materials?
Other thoughts?
-Tim
-- Tim Wilson | Visit Sibley online: | Check out: Henry Sibley HS | http://www.isd197.org | http://www.zope.org W. St. Paul, MN | | http://slashdot.org wilson@visi.com | <dtml-var pithy_quote> | http://linux.com
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig
participants (4)
-
Daniel Yoo
-
Jason Cunliffe
-
Seth David Schoen
-
Timothy Wilson