[Tutor] Re:Base 207 compression algorithm

Jeff Shannon jeff@ccvcorp.com
Thu Jun 26 16:39:01 2003

cino hilliard wrote:

> To:Jeff Shannon and all other Python users
> Cino Hilliard wrote:
>>> # we can determine the compression ratio for various bases. # Eg., 
>>> base 2 = 332%,
>>> # base 8 =111%, base 10 =100%, base 100 = 50%, base 207 = 43.2%.
> Jeff Shannon wrote:
>> This is mistaken, because you're only changing how many characters 
>> are used to display it on the screen.  No matter what base it's 
>> displayed in, it is still *stored* in binary, and this will not 
>> compress anything. Whether you see 'FF' or '255' or '@' (or whatever 
>> other character might be used to represent that number in whatever 
>> base you try to use), it still must occupy one byte of memory / hard 
>> drive space.  Once again, you're confusing the display with 
>> internals.  Changing the base only affects display.
> Does anyone else in the python community agree with this?
> Attached is a script that can be used to zip numbers. It is a base 
> 2-207 compression algorithm that I
> developed using python's arbitrary integer precision feature. Base 207 
> was used for the output in this
> example.
> Here is the output for 5000 digits of Pi stored to two files pib10.txt 
> for the decimal expansion and
> pib207.txt for the base 207 conversion. I included Win Zip and pkzip 
> files also. You will notice that the
> compression in base 207 gives a better ratio than the Zip file 
> compression. The compression ratio
> improves logorithmically with increasing bases.

That is because you are comparing the size of strings used to represent 
a number, rather than demonstrating any real compression.  However, 
you're not going to have any luck in actually doing any math with either 
of these strings.  In either case, you'd have better "compression" by 
simply expressing Pi as a large binary float -- you could probably get 
that much precision in less than a hundred bytes (probably *much* less), 
compared to your 2100.  (I'm not about to try to do the math to 
determine how many floating-point bits would be needed to achieve 5000 
digits of precision, but I'm quite confident that it's at *least* an 
order of magnitude less than your string representation.  Of course, 
note that '5000 digits of precision' inherently implies that one is 
speaking of a particular base, which we presume to be decimal...)

The real issue here, of course, is that this is not truly compression in 
the general sense, because it only applies to string representations of 
numbers.  If you really want to show compression, then take an arbitrary 
string (say, the contents of your Windows' autoexec.bat or the contents 
of your *nix /etc/passwd, or any generic logfile) and show how that can 
be expressed in fewer characters by converting to a higher-radix number. 
 Don't forget to show how to re-create the original file afterwards.  

And of course, even as far as representing numbers, this is not good 
compression, because expressing a binary integer or float will take much 
less space than representing a string of characters in any base.  For a 
computer (where each character is represented by a 1-byte value), it is 
not possible to have more characters that can be represented by a single 
byte than there are integers that can be expressed by that single byte. 
 If you look at sys.maxint as an example, this is the largest number 
that can be expressed in a (four-byte) integer.  But it takes ten 
characters to represent it in decimal, and it'd take 6 or 7 in base 207. 
 You could claim that it only takes four bytes in base 256 -- but at 
that point, the computer representation is *exactly* the same as for a 
binary integer.  And you cannot improve this on a standard computer by 
using a base higher than 256, because each character will require 
multiple bytes, so your 'three character' string in base 1000 (and who 
could keep track of a thousand different symbols, anyhow?) will require 
at least six bytes to represent it.

Jeff Shannon
Credit International