Algorithm that makes maximum compression of completly diffused data.
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Nov 8 05:47:46 CET 2013
On Thu, 07 Nov 2013 18:43:17 -0800, Mark Janssen wrote:
>>> I am not sure if it is just stupidness or laziness that prevent you
>>> from seeing that 4^8=65536.
>>
>> I can see that 4^8 = 65536. Now how are you going to render 65537? You
>> claimed that you could render *any* number efficiently. What you've
>> proven is that a small subset of numbers can be rendered efficiently.
>
> I think the idea would be to find the prime factorization for a given
> number, which has been proven to be available (and unique) for any and
> every number. Most numbers can compress given this technique.
Sadly, they would not.
Let's suppose you wish to compress the number 143. So you find the prime
factorization, 11*13=143. Have you compressed anything? No you have not
-- the original number, 143, fits in a single byte, while it requires
*two* bytes to deal with 11 and 13 (one byte for 11, and a second byte
for 13).
We can try a different strategy: while 143 requires a full eight bits
(one byte), both 11 and 13 require only four bits (one nybble) each. So
by hard coding our expectation that the result will be two nybbles, we
can avoid expanding the data and end up with exactly zero compression:
143 = binary 10001110 or eight bits in total
11, 13 = binary 1011 1101 or eight bits in total
But while that trick works for 143, it doesn't work for 154 (one byte,
binary 10011010) which has prime factorization 2*7*11 (three nybbles 0010
0111 1011). Even if you can somehow drop the leading zeroes, it requires
nine bits versus eight.
Suppose instead of using bytes, you use decimal digits. 11 and 13 use
four digits, while 143 uses only three. Likewise, 154 has three decimal
digits while 2*7*11 has four digits. In both cases, there is no
compression.
How about recording which prime number it is, instead of the actual value
of the prime? So 154=2*7*11 expands to the 1st, 4th and 5th prime, so
maybe we can record 1 4 5 instead of 2 7 11 and save some bits. (We'll
save more bits the larger the prime.) Of course, to reverse the
compression you need to keep a table of the prime numbers somewhere, and
that's going to be a lot bigger than the space you save...
Now, I accept that, possibly, with sufficient work we might be able to
find some cunning mechanism by which we can compress *any* integer value.
But it won't be the same mechanism for every value! For example, we might
compress (2**1000+1) using a run-length encoding of the binary bits,
while compressing 14629839399572435720381495231 as its prime
factorization (the 319th prime number, the 479th prime, the 499th prime
six times), and 10**1000 as an encoding of the decimal digits. But we
still need some sort of tag to distinguish between these different
compression schemes, and once you take into account the extra information
in the tag, there will be cases where some numbers aren't compressed at
all.
The ability to losslessly compress *everything* is sheer pseudo-
mathematics, up there with finding an exact rational value for pi, or
squaring the circle using only a compass and straight-edge. But the
ability to losslessly compress *useful data* is real, and such algorithms
operate by finding non-random patterns and redundant information in the
data.
--
Steven
More information about the Python-list
mailing list