[Tutor] Possible Useless Python Challenge. [huffman encoding]

Rob Andrews rob@jam.rr.com
Tue, 08 May 2001 06:19:32 -0500


I'm so glad Danny Yoo became involved in programming instead of
competing on the karaoke circuit or something. Thanks to Danny and
Deirdre both for bringing the subject more into the light.

Rob

Daniel Yoo wrote:
> 
> On Mon, 7 May 2001, Rob Andrews wrote:
> 
> > Sometimes I barely understand a word of what's currently being discussed
> > and still think it sounds nifty. This is such a time.
> >
> > Deirdre Saoirse Moen wrote:
> > >
> > > >On Sun, 6 May 2001, Tesla Coil wrote:
> > > >
> > > >>  KiSS data sets are bundled using LZH compression,
> > > >>  if there's a module for that, I haven't located it.
> > > >
> > > >I haven't found an LZH decompressor for Python.  In fact, the GnomeKISS
> > > >people are depending on a separate LHA extraction program, so perhaps a
> > > >Python KISS program should delegate the handling of LZH compression to the
> > > >LHA utility.
> > >
> > > I've written both Lempel-Ziv and Huffman compressors and
> > > decompressors, but not LZH per se and not in Python. Given that
> > > Python has all the requisite bitwise operators, it would be possible,
> > > but it'd probably be better to write a wrapper for an existing set of
> > > C functions.
> 
> *laugh* I have a little bit of code that almost implements Huffman
> encoding.  (VanL, this uses some of the tree code I mentioned earlier.)
> I was writing this during the afternoon.  Coincidence, huh?
> 
> I'll try to give the flavor of what Huffman encoding is.  We need to give
> a little background though.  [note, this is LONG.]
> 
> When we represent letters --- one character strings --- Python (and mostly
> every other computer language) internally uses a numeric code called
> ASCII: American Standard Code for Information Interchange, to encode each
> character.
> 
> Our computers are bit crunchers, so this means that every character we
> store needs to be saved as a number.  For example, we could look at the
> string:
> 
>     "hello world"
> 
> as a list of ASCII numbers:
> 
> ###
> >>> map(ord, 'hello world')
> [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
> ###
> 
> but this is still one step removed from how computers REALLY represent
> characters; they represent them, not just in numeric form, but in crude,
> binary, base-two!  Ones and zeros, and that's really weird when we first
> see it.  Here's a small program that will try to reveal what's underneath:
> 
> ###
> >>> def binaryStringHelper(n):
> ...     if n == 0: return ''
> ...     else: return binaryStringHelper(n >> 1) + str(n % 2)
> ...
> >>> def toBinaryString(n):
> ...     if n == 0: return string.zfill(0, 8)
> ...     return string.zfill(int(binaryStringHelper(n)), 8)
> ...
> >>> pprint.pprint(map(toBinaryString, range(10)))
> ['00000000',          ## ...stands for 0
>  '00000001',          ## ...stands for 1
>  '00000010',          ##               2
>  '00000011',          ##               3
>  '00000100',          ##             ...
>  '00000101',
>  '00000110',
>  '00000111',
>  '00001000',          ##               8
>  '00001001']          ##               9
> ###
> 
> This is doing some hacky bit manipulation which isn't really important:
> the important thing is that this program above allows us to visualize all
> those one's and zero's that the computer actually uses to represent a
> character.  We can apply this on our string:
> 
> ###
> >>> pprint.pprint(map(toBinaryString, map(ord, 'hello world')))
> ['01101000',
>  '01100101',
>  '01101100',
>  '01101100',
>  '01101111',
>  '00100000',
>  '01110111',
>  '01101111',
>  '01110010',
>  '01101100',
>  '01100100']
> ###
> 
> and this is how computers today truly store 'hello world' in its memory,
> as streams of ones and zeros.  Every character is eight bits, so any
> string of length 'n' will take up '8*n' bits.  Our small sentence, 'hello
> world', takes up 88 bits!
> 
> Ok, this doesn't sound as world-threatening as we're making it out to be,
> but what is neat about Huffman encoding is that we can usually do much
> better than 88 bits.
> 
> What Huffman realized is that we're not taking advantage of a certain
> curious property about human language: we use certain words over and over,
> certain letters over and over, and our language is fraught with
> repetitious repetition. What if we don't always represent a character by 8
> bits, but by something that depends on how frequently the letter occurs in
> our language?  Instead of having every letter be represented by eight
> bits, we can make a "variable length" encoding.  That is, perhaps we can
> use the following code:
> 
>     'h' ===> 0000
>     'w' ===> 0001
>     'd' ===> 0010
>     'e' ===> 0011
>     'o' ===> 01
>     'r' ===> 100
>     ' ' ===> 101
>     'l' ===> 11
> 
> That is, instead of using all eight bits on every letter, we can use a
> shorter binary sequence for really frequent characters, and longer codes
> for letters that we don't use as often.  For letters that don't occur at
> all, we don't even need to give them a code.  Huffman's encoding gives us
> a way of figuring out really good codes that do this for us.  What this
> means is, that, instead of representing 'hello world' as:
> 
> ###
> >>> string.join(map(toBinaryString, map(ord, 'hello world')), '')
> '0110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'
> ###
> 
> we can do it like this:
> 
> ###
> >>> code = {'h' : '0000', 'w' : '0001', 'd' : '0010', 'e' : '0011',
>             'o' : '01',   'r' : '101',  ' ' : '101',  'l' : '11'}
> >>> string.join(map(lambda c: code[c], 'hello world'), '')
> '00000011111101101000101101110010'
> ####
> 
> which is a heck of a lot shorter than what we had previously.  This is the
> heart of Huffman encoding: to find a binary representation for each number
> so that a message doesn't take so much space to store.
> 
> Of course this is handwavy, because whoever wants to read our message
> needs to have a copy of the dictionary!  They need to be able to convert
> our binary message back into something readable, so we usually send off
> the dictionary alongside the encoded text.  When a message gets large
> enough, though, that overhead becomes negigible compared to the amount of
> space we save.
> 
> Finally, the code itself needs to guarantee that, given a bit string,
> there's only one way to rehydrate it back to it's nice spongy, wordy form.
> Not coinciently, Huffman encoding guarantees this for us.  Really smart
> guy, and really neat algorithm.
> 
> Does this make sense?  I get the feeling I talked too much on this one
> already.  *grin*
> 
> The incomplete code to do Huffman encoding should be attached to this
> message.  As soon as I get it in a nicer form, I'll send it off to Useless
> Python.
> 
>   ------------------------------------------------------------------------
>                  Name: huffman.py
>    huffman.py    Type: Plain Text (TEXT/PLAIN)
>              Encoding: BASE64

-- 

Useless Python!
If your Python is this useless, we need you.
http://www.lowerstandard.com/python/pythonsource.html