Tar for python? Better compressed file archives 'r us?

Drew Csillag drew_csillag at geocities.com
Mon Dec 17 11:43:20 EST 2001


On Mon, Dec 17, 2001 at 05:40:58PM +1100, Richard Jones wrote:
> Does anyone have tar written in python? I've just compared the difference 
> between a .zip and .tgz of the same directory structure, and the sizes are:
> 
> -rw-rw-r--    1 builder  builder   3796376 Dec 17 15:39 zope.zip
> -rw-rw-r--    1 builder  builder   2270562 Dec 17 15:55 zope.tgz
> 
> (the zip is a zope source tree with the C objects built)
> 
> Having looked at the ZipFile source, I gather that zip compresses the stored 
> files individually, whereas gzip'ing tar files will take advantage of the 
> large amount of similarity between the files in the archive. The result being 
> a loss of 1.5Mb of extraneous download :)
> 
> In the meantime, I'm creating the zip file with ZIP_STORED and compressing 
> the result...
> 
> -rw-rw-r--    1 builder  builder   2635321 Dec 17 16:20 zope.zip.gz
> 
> ... strange, it's still bigger than zope.tgz... but it's still much better 
> than the zip file. Can't be read by unzip, but I don't care in this instance.
> 
> Anyone else had any fun in this area? Any ideas why .zip.gz is so much bigger 
> than .tgz?
> 
> 
>     Richard


I've got some code that will read a tar file (it's not pretty, but
it's enough to step through it for now).  It's at the end of this
message.

As to why zip files tend to be larger than tar files it's because each
of the files in a zip file are compressed separately, whereby in a
tar.gz, the whole tar file is compressed as a single unit.  Thus the
compression in a tar.gz can make use of redundancies across files in
the tarball (at least up to the size of the compression block IIRC,
but I don't want to get too deep) to improve compression, whereby zip
files cannot take advantage of this.

As to why the .zip.gz is still considerably larger than the tar.gz,
it's because zip files (this holds for .gz files too) don't compress
well, if at all because *most* of the redundany has been eliminated
already, and since the way it is encoded -- the deflate algorithm in
zlib is a mixture of LZ77 and huffman encoding and huffman encoding
doesn't generally use full bytes, i.e. it often encodes 1 byte in less
than 8 bits (or else it wouldn't be compressing -- duh) -- so the byte
boundaries don't necessarily line up and IIRC, neither the LZ77, nor
huffman try to find redundancies on a sub-byte level.


Cheers,
Drew

#------------cut here-----------
import string
import struct

def cvtnulloctal(f, k=None):
    zi = f.find('\0')
    if zi > -1:
        f = f[:zi]
    try:
        return string.atoi(f, 8)
    except:
        return 0

def tarstr(f):
    zi = f.find('\0')
    if zi > -1:
        f = f[:zi]
    return f

def readTar(fileObj):
    while 1:
        header = fileObj.read(512)
        if len(header) != 512:
            raise EOFError, 'Unexpected end of tar stream'

        (name, mode, uid, gid, size, mtime, cksum, typeflag,
         linkname, ustar_p, ustar_vsn, uname, gname, devmaj,
         devmin, prefix) = struct.unpack(
            '100s8s8s8s12s12s8s1s100s6s6s32s32s8s8s155s', header[:504])

        name, linkname, uname, gname, prefix = map(tarstr, (
            name, linkname, uname, gname, prefix))

        mode, uid, gid, size, mtime, cksum, devmaj, devmin = map(
            cvtnulloctal, (mode, uid, gid, size, mtime, cksum, devmaj, devmin))
                                                            
        blocks_to_read = size / 512
        if size - (blocks_to_read * 512):
            blocks_to_read += 1
        contents = fileObj.read(blocks_to_read * 512)
        contents = contents[:size]

        if name:
            ecount = 0
        else:
            ecount += 1
        if ecount == 2:
            break
        
        if name: #null name fields are normal in tar files, so have to check

            #here you would do whatever you wanted with the information
            #in: name, linkname, uname, gname, mode, uid,gid,size,mtime,devmaj
            #devmin, contents
            print name, size
             

if __name__ == '__main__':
    import gzip, sys
    f = readTar(gzip.GzipFile(sys.argv[1]))
#------------cut here-----------





More information about the Python-list mailing list