
Scott Dial wrote:
On 6/30/2010 2:53 PM, Barry Warsaw wrote:
It might be amazing, but it's still a significant overhead. As I've described, multiply that by all the py files in all the distro packages containing Python source code, and then still try to fit it on a CDROM.
I decided to prove to myself that it was not a significant issue to have parallel directory structures in a .tar.bz2, and I was surprised to find it much worse at that then I had imagined. For example,
# cd /usr/lib/python2.6/site-packages # tar --exclude="*.pyc" --exclude="*.pyo" \ -cjf mercurial.tar.bz2 mercurial # du -h mercurial.tar.bz2 640K mercurial.tar.bz2
# cp -a mercurial mercurial2 # tar --exclude="*.pyc" --exclude="*.pyo" \ -cjf mercurial2.tar.bz2 mercurial mercurial2 # du -h mercurial.tar.bz2 1.3M mercurial2.tar.bz2
I believe the standard (and largest) block size for .bz2 is 900kB, and I *think* that is uncompressed. Though I know that bz2 can chain, since it can compress all NULL bytes extremely well (multiple GB down to kB, IIRC). There was a question as to whether LZMA would do better here, I'm using 7zip, but .xz should perform similarly. $ du -sh mercurial* 2.6M mercurial 2.6M mercurial2 366K mercurial.tar.bz2 734K mercurial2.tar.bz2 303K mercurial.7z 310K mercurial2.7z So LZMA with the 'normal' compression has a big enough window to find almost all of the redundancy, and 310kB is certainly a very small increase over the 303kB. And clearly bz2 does not, since 734kB is actually slightly more than 2x 366kB. John =:->