[Tutor] Possible Useless Python Challenge. [huffman encoding]

steve lonetwin@yahoo.com
Tue, 8 May 2001 14:20:43 +0530


Hey Dan,
 Thanx, Every day I learn something new, an' I just finished today's quot=
a !!!

Peace
Steve

P.S: U pretty good at xplainin' stuff, did n e body ever tell u that ???


On Tuesday 08 May 2001 10:35, you wrote:

> > On Mon, 7 May 2001, Rob Andrews wrote:
> > Sometimes I barely understand a word of what's currently being discus=
sed
> > and still think it sounds nifty. This is such a time.
> >
> > Deirdre Saoirse Moen wrote:
> > > >On Sun, 6 May 2001, Tesla Coil wrote:
> > > >>  KiSS data sets are bundled using LZH compression,
> > > >>  if there's a module for that, I haven't located it.
> > > >
> > > >I haven't found an LZH decompressor for Python.  In fact, the
> > > > GnomeKISS people are depending on a separate LHA extraction progr=
am,
> > > > so perhaps a Python KISS program should delegate the handling of =
LZH
> > > > compression to the LHA utility.
> > >
> > > I've written both Lempel-Ziv and Huffman compressors and
> > > decompressors, but not LZH per se and not in Python. Given that
> > > Python has all the requisite bitwise operators, it would be possibl=
e,
> > > but it'd probably be better to write a wrapper for an existing set =
of
> > > C functions.
>
> *laugh* I have a little bit of code that almost implements Huffman
> encoding.  (VanL, this uses some of the tree code I mentioned earlier.)
> I was writing this during the afternoon.  Coincidence, huh?
>
> I'll try to give the flavor of what Huffman encoding is.  We need to gi=
ve
> a little background though.  [note, this is LONG.]
>
> When we represent letters --- one character strings --- Python (and mos=
tly
> every other computer language) internally uses a numeric code called
> ASCII: American Standard Code for Information Interchange, to encode ea=
ch
> character.
>
> Our computers are bit crunchers, so this means that every character we
> store needs to be saved as a number.  For example, we could look at the
> string:
>
>     "hello world"
>
> as a list of ASCII numbers:
>
> ###
>
> >>> map(ord, 'hello world')
>
> [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
> ###
>
> but this is still one step removed from how computers REALLY represent
> characters; they represent them, not just in numeric form, but in crude=
,
> binary, base-two!  Ones and zeros, and that's really weird when we firs=
t
> see it.  Here's a small program that will try to reveal what's undernea=
th:
>
> ###
>
> >>> def binaryStringHelper(n):
>
> ...     if n =3D=3D 0: return ''
> ...     else: return binaryStringHelper(n >> 1) + str(n % 2)
> ...
>
> >>> def toBinaryString(n):
>
> ...     if n =3D=3D 0: return string.zfill(0, 8)
> ...     return string.zfill(int(binaryStringHelper(n)), 8)
> ...
>
> >>> pprint.pprint(map(toBinaryString, range(10)))
>
> ['00000000',          ## ...stands for 0
>  '00000001',          ## ...stands for 1
>  '00000010',          ##               2
>  '00000011',          ##               3
>  '00000100',          ##             ...
>  '00000101',
>  '00000110',
>  '00000111',
>  '00001000',          ##               8
>  '00001001']          ##               9
> ###
>
> This is doing some hacky bit manipulation which isn't really important:
> the important thing is that this program above allows us to visualize a=
ll
> those one's and zero's that the computer actually uses to represent a
> character.  We can apply this on our string:
>
>
> ###
>
> >>> pprint.pprint(map(toBinaryString, map(ord, 'hello world')))
>
> ['01101000',
>  '01100101',
>  '01101100',
>  '01101100',
>  '01101111',
>  '00100000',
>  '01110111',
>  '01101111',
>  '01110010',
>  '01101100',
>  '01100100']
> ###
>
> and this is how computers today truly store 'hello world' in its memory=
,
> as streams of ones and zeros.  Every character is eight bits, so any
> string of length 'n' will take up '8*n' bits.  Our small sentence, 'hel=
lo
> world', takes up 88 bits!
>
>
> Ok, this doesn't sound as world-threatening as we're making it out to b=
e,
> but what is neat about Huffman encoding is that we can usually do much
> better than 88 bits.
>
> What Huffman realized is that we're not taking advantage of a certain
> curious property about human language: we use certain words over and ov=
er,
> certain letters over and over, and our language is fraught with
> repetitious repetition. What if we don't always represent a character b=
y 8
> bits, but by something that depends on how frequently the letter occurs=
 in
> our language?  Instead of having every letter be represented by eight
> bits, we can make a "variable length" encoding.  That is, perhaps we ca=
n
> use the following code:
>
>     'h' =3D=3D=3D> 0000
>     'w' =3D=3D=3D> 0001
>     'd' =3D=3D=3D> 0010
>     'e' =3D=3D=3D> 0011
>     'o' =3D=3D=3D> 01
>     'r' =3D=3D=3D> 100
>     ' ' =3D=3D=3D> 101
>     'l' =3D=3D=3D> 11
>
> That is, instead of using all eight bits on every letter, we can use a
> shorter binary sequence for really frequent characters, and longer code=
s
> for letters that we don't use as often.  For letters that don't occur a=
t
> all, we don't even need to give them a code.  Huffman's encoding gives =
us
> a way of figuring out really good codes that do this for us.  What this
> means is, that, instead of representing 'hello world' as:
>
> ###
>
> >>> string.join(map(toBinaryString, map(ord, 'hello world')), '')
>
> '0110100001100101011011000110110001101111001000000111011101101111011100=
1001
>10110001100100' ###
>
>
> we can do it like this:
>
> ###
>
> >>> code =3D {'h' : '0000', 'w' : '0001', 'd' : '0010', 'e' : '0011',
>
>             'o' : '01',   'r' : '101',  ' ' : '101',  'l' : '11'}
>
> >>> string.join(map(lambda c: code[c], 'hello world'), '')
>
> '00000011111101101000101101110010'
> ####
>
> which is a heck of a lot shorter than what we had previously.  This is =
the
> heart of Huffman encoding: to find a binary representation for each num=
ber
> so that a message doesn't take so much space to store.
>
> Of course this is handwavy, because whoever wants to read our message
> needs to have a copy of the dictionary!  They need to be able to conver=
t
> our binary message back into something readable, so we usually send off
> the dictionary alongside the encoded text.  When a message gets large
> enough, though, that overhead becomes negigible compared to the amount =
of
> space we save.
>
> Finally, the code itself needs to guarantee that, given a bit string,
> there's only one way to rehydrate it back to it's nice spongy, wordy fo=
rm.
> Not coinciently, Huffman encoding guarantees this for us.  Really smart
> guy, and really neat algorithm.
>
> Does this make sense?  I get the feeling I talked too much on this one
> already.  *grin*
>
>
> The incomplete code to do Huffman encoding should be attached to this
> message.  As soon as I get it in a nicer form, I'll send it off to Usel=
ess
> Python.

----------------------------------------
Content-Type: TEXT/PLAIN; charset=3D"US-ASCII"; name=3D"huffman.py"
Content-Transfer-Encoding: BASE64
Content-Description:=20
----------------------------------------

--=20
||||||||||||||||||||||
|||||||||#####||||||||
||||||||#######|||||||
||||||||# O O #|||||||
||||||||#\ ~ /#|||||||
||||||##||\_/||##|||||
|||||#||||||||||##||||
||||#||||||||||||##|||
||||#|||||||||||||##||=09
|||/\##|||||||||##/\||=09
|/    \#########/    \=09
|\     \#######/     /=09
||\____/#######\____/|=09
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=09
"Unfortunately, those people who have nothing better to do than post on t=
he
Internet all day long are rarely the ones who have the most insights."
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D