Compression
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Jul 14 04:16:44 EDT 2016
I thought I'd experiment with some of Python's compression utilities. First I
thought I'd try compressing some extremely non-random data:
py> import codecs
py> data = "something non-random."*1000
py> len(data)
21000
py> len(codecs.encode(data, 'bz2'))
93
py> len(codecs.encode(data, 'zip'))
99
That's really good results. Both the bz2 and Gzip compressors have been able to
compress nearly all of the redundancy in the data.
What if we shuffle the data so it is more random?
py> import random
py> data = list(data)
py> random.shuffle(data)
py> data = ''.join(data)
py> len(data); len(codecs.encode(data, 'bz2'))
21000
10494
How about some really random data?
py> import string
py> data = ''.join(random.choice(string.ascii_letters) for i in range(21000))
py> len(codecs.encode(data, 'bz2'))
15220
That's actually better than I expected: it's found some redundancy and saved
about a quarter of the space. What if we try compressing data which has already
been compressed?
py> cdata = codecs.encode(data, 'bz2')
py> len(cdata); len(codecs.encode(cdata, 'bz2'))
15220
15688
There's no shrinkage at all; compression has actually increased the size.
What if we use some data which is random, but heavily biased?
py> values = string.ascii_letters + ("AAAAAABB")*100
py> data = ''.join(random.choice(values) for i in range(21000))
py> len(data); len(codecs.encode(data, 'bz2'))
21000
5034
So we can see that the bz2 compressor is capable of making use of deviations
from uniformity, but the more random the initial data is, the less effective is
will be.
--
Steve
More information about the Python-list
mailing list