[Tutor] Possible Useless Python Challenge. [huffman encoding]
steve
lonetwin@yahoo.com
Tue, 8 May 2001 14:20:43 +0530
Hey Dan,
Thanx, Every day I learn something new, an' I just finished today's quot=
a !!!
Peace
Steve
P.S: U pretty good at xplainin' stuff, did n e body ever tell u that ???
On Tuesday 08 May 2001 10:35, you wrote:
> > On Mon, 7 May 2001, Rob Andrews wrote:
> > Sometimes I barely understand a word of what's currently being discus=
sed
> > and still think it sounds nifty. This is such a time.
> >
> > Deirdre Saoirse Moen wrote:
> > > >On Sun, 6 May 2001, Tesla Coil wrote:
> > > >> KiSS data sets are bundled using LZH compression,
> > > >> if there's a module for that, I haven't located it.
> > > >
> > > >I haven't found an LZH decompressor for Python. In fact, the
> > > > GnomeKISS people are depending on a separate LHA extraction progr=
am,
> > > > so perhaps a Python KISS program should delegate the handling of =
LZH
> > > > compression to the LHA utility.
> > >
> > > I've written both Lempel-Ziv and Huffman compressors and
> > > decompressors, but not LZH per se and not in Python. Given that
> > > Python has all the requisite bitwise operators, it would be possibl=
e,
> > > but it'd probably be better to write a wrapper for an existing set =
of
> > > C functions.
>
> *laugh* I have a little bit of code that almost implements Huffman
> encoding. (VanL, this uses some of the tree code I mentioned earlier.)
> I was writing this during the afternoon. Coincidence, huh?
>
> I'll try to give the flavor of what Huffman encoding is. We need to gi=
ve
> a little background though. [note, this is LONG.]
>
> When we represent letters --- one character strings --- Python (and mos=
tly
> every other computer language) internally uses a numeric code called
> ASCII: American Standard Code for Information Interchange, to encode ea=
ch
> character.
>
> Our computers are bit crunchers, so this means that every character we
> store needs to be saved as a number. For example, we could look at the
> string:
>
> "hello world"
>
> as a list of ASCII numbers:
>
> ###
>
> >>> map(ord, 'hello world')
>
> [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
> ###
>
> but this is still one step removed from how computers REALLY represent
> characters; they represent them, not just in numeric form, but in crude=
,
> binary, base-two! Ones and zeros, and that's really weird when we firs=
t
> see it. Here's a small program that will try to reveal what's undernea=
th:
>
> ###
>
> >>> def binaryStringHelper(n):
>
> ... if n =3D=3D 0: return ''
> ... else: return binaryStringHelper(n >> 1) + str(n % 2)
> ...
>
> >>> def toBinaryString(n):
>
> ... if n =3D=3D 0: return string.zfill(0, 8)
> ... return string.zfill(int(binaryStringHelper(n)), 8)
> ...
>
> >>> pprint.pprint(map(toBinaryString, range(10)))
>
> ['00000000', ## ...stands for 0
> '00000001', ## ...stands for 1
> '00000010', ## 2
> '00000011', ## 3
> '00000100', ## ...
> '00000101',
> '00000110',
> '00000111',
> '00001000', ## 8
> '00001001'] ## 9
> ###
>
> This is doing some hacky bit manipulation which isn't really important:
> the important thing is that this program above allows us to visualize a=
ll
> those one's and zero's that the computer actually uses to represent a
> character. We can apply this on our string:
>
>
> ###
>
> >>> pprint.pprint(map(toBinaryString, map(ord, 'hello world')))
>
> ['01101000',
> '01100101',
> '01101100',
> '01101100',
> '01101111',
> '00100000',
> '01110111',
> '01101111',
> '01110010',
> '01101100',
> '01100100']
> ###
>
> and this is how computers today truly store 'hello world' in its memory=
,
> as streams of ones and zeros. Every character is eight bits, so any
> string of length 'n' will take up '8*n' bits. Our small sentence, 'hel=
lo
> world', takes up 88 bits!
>
>
> Ok, this doesn't sound as world-threatening as we're making it out to b=
e,
> but what is neat about Huffman encoding is that we can usually do much
> better than 88 bits.
>
> What Huffman realized is that we're not taking advantage of a certain
> curious property about human language: we use certain words over and ov=
er,
> certain letters over and over, and our language is fraught with
> repetitious repetition. What if we don't always represent a character b=
y 8
> bits, but by something that depends on how frequently the letter occurs=
in
> our language? Instead of having every letter be represented by eight
> bits, we can make a "variable length" encoding. That is, perhaps we ca=
n
> use the following code:
>
> 'h' =3D=3D=3D> 0000
> 'w' =3D=3D=3D> 0001
> 'd' =3D=3D=3D> 0010
> 'e' =3D=3D=3D> 0011
> 'o' =3D=3D=3D> 01
> 'r' =3D=3D=3D> 100
> ' ' =3D=3D=3D> 101
> 'l' =3D=3D=3D> 11
>
> That is, instead of using all eight bits on every letter, we can use a
> shorter binary sequence for really frequent characters, and longer code=
s
> for letters that we don't use as often. For letters that don't occur a=
t
> all, we don't even need to give them a code. Huffman's encoding gives =
us
> a way of figuring out really good codes that do this for us. What this
> means is, that, instead of representing 'hello world' as:
>
> ###
>
> >>> string.join(map(toBinaryString, map(ord, 'hello world')), '')
>
> '0110100001100101011011000110110001101111001000000111011101101111011100=
1001
>10110001100100' ###
>
>
> we can do it like this:
>
> ###
>
> >>> code =3D {'h' : '0000', 'w' : '0001', 'd' : '0010', 'e' : '0011',
>
> 'o' : '01', 'r' : '101', ' ' : '101', 'l' : '11'}
>
> >>> string.join(map(lambda c: code[c], 'hello world'), '')
>
> '00000011111101101000101101110010'
> ####
>
> which is a heck of a lot shorter than what we had previously. This is =
the
> heart of Huffman encoding: to find a binary representation for each num=
ber
> so that a message doesn't take so much space to store.
>
> Of course this is handwavy, because whoever wants to read our message
> needs to have a copy of the dictionary! They need to be able to conver=
t
> our binary message back into something readable, so we usually send off
> the dictionary alongside the encoded text. When a message gets large
> enough, though, that overhead becomes negigible compared to the amount =
of
> space we save.
>
> Finally, the code itself needs to guarantee that, given a bit string,
> there's only one way to rehydrate it back to it's nice spongy, wordy fo=
rm.
> Not coinciently, Huffman encoding guarantees this for us. Really smart
> guy, and really neat algorithm.
>
> Does this make sense? I get the feeling I talked too much on this one
> already. *grin*
>
>
> The incomplete code to do Huffman encoding should be attached to this
> message. As soon as I get it in a nicer form, I'll send it off to Usel=
ess
> Python.
----------------------------------------
Content-Type: TEXT/PLAIN; charset=3D"US-ASCII"; name=3D"huffman.py"
Content-Transfer-Encoding: BASE64
Content-Description:=20
----------------------------------------
--=20
||||||||||||||||||||||
|||||||||#####||||||||
||||||||#######|||||||
||||||||# O O #|||||||
||||||||#\ ~ /#|||||||
||||||##||\_/||##|||||
|||||#||||||||||##||||
||||#||||||||||||##|||
||||#|||||||||||||##||=09
|||/\##|||||||||##/\||=09
|/ \#########/ \=09
|\ \#######/ /=09
||\____/#######\____/|=09
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=09
"Unfortunately, those people who have nothing better to do than post on t=
he
Internet all day long are rarely the ones who have the most insights."
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D