[Tutor] Re:Base 207 compression algorithm
Jeff Shannon
jeff@ccvcorp.com
Thu Jun 26 16:39:01 2003
cino hilliard wrote:
> To:Jeff Shannon and all other Python users
>
> Cino Hilliard wrote:
>
>>> # we can determine the compression ratio for various bases. # Eg.,
>>> base 2 = 332%,
>>> # base 8 =111%, base 10 =100%, base 100 = 50%, base 207 = 43.2%.
>>
>>
>>
> Jeff Shannon wrote:
>
>> This is mistaken, because you're only changing how many characters
>> are used to display it on the screen. No matter what base it's
>> displayed in, it is still *stored* in binary, and this will not
>> compress anything. Whether you see 'FF' or '255' or '@' (or whatever
>> other character might be used to represent that number in whatever
>> base you try to use), it still must occupy one byte of memory / hard
>> drive space. Once again, you're confusing the display with
>> internals. Changing the base only affects display.
>
>
> Does anyone else in the python community agree with this?
>
>
> Attached is a script that can be used to zip numbers. It is a base
> 2-207 compression algorithm that I
> developed using python's arbitrary integer precision feature. Base 207
> was used for the output in this
> example.
>
> Here is the output for 5000 digits of Pi stored to two files pib10.txt
> for the decimal expansion and
> pib207.txt for the base 207 conversion. I included Win Zip and pkzip
> files also. You will notice that the
> compression in base 207 gives a better ratio than the Zip file
> compression. The compression ratio
> improves logorithmically with increasing bases.
That is because you are comparing the size of strings used to represent
a number, rather than demonstrating any real compression. However,
you're not going to have any luck in actually doing any math with either
of these strings. In either case, you'd have better "compression" by
simply expressing Pi as a large binary float -- you could probably get
that much precision in less than a hundred bytes (probably *much* less),
compared to your 2100. (I'm not about to try to do the math to
determine how many floating-point bits would be needed to achieve 5000
digits of precision, but I'm quite confident that it's at *least* an
order of magnitude less than your string representation. Of course,
note that '5000 digits of precision' inherently implies that one is
speaking of a particular base, which we presume to be decimal...)
The real issue here, of course, is that this is not truly compression in
the general sense, because it only applies to string representations of
numbers. If you really want to show compression, then take an arbitrary
string (say, the contents of your Windows' autoexec.bat or the contents
of your *nix /etc/passwd, or any generic logfile) and show how that can
be expressed in fewer characters by converting to a higher-radix number.
Don't forget to show how to re-create the original file afterwards.
And of course, even as far as representing numbers, this is not good
compression, because expressing a binary integer or float will take much
less space than representing a string of characters in any base. For a
computer (where each character is represented by a 1-byte value), it is
not possible to have more characters that can be represented by a single
byte than there are integers that can be expressed by that single byte.
If you look at sys.maxint as an example, this is the largest number
that can be expressed in a (four-byte) integer. But it takes ten
characters to represent it in decimal, and it'd take 6 or 7 in base 207.
You could claim that it only takes four bytes in base 256 -- but at
that point, the computer representation is *exactly* the same as for a
binary integer. And you cannot improve this on a standard computer by
using a base higher than 256, because each character will require
multiple bytes, so your 'three character' string in base 1000 (and who
could keep track of a thousand different symbols, anyhow?) will require
at least six bytes to represent it.
Jeff Shannon
Technician/Programmer
Credit International