Problem with tarfile module to open *.tar.gz files - unreliable ?

Peter Otten __peter__ at web.de
Fri Aug 20 19:55:54 CEST 2010


m_ahlenius wrote:

> I am using Python 2.6.5.
> 
> Unfortunately I don't have other versions installed so its hard to
> test with a different version.
> 
> As for the log compression, its a bit hard to test.  Right now I may
> process 100+ of these logs per night, and will get maybe 5 which are
> reported as corrupt (typically a bad CRC) and 2 which it reported as a
> bad tar archive.  This morning I checked each of the 7 reported
> problem files by manually opening them with "tar -xzvof" and they were
> all indeed corrupt. Sign.

So many corrupted files? I'd say you have to address the problem with your 
infrastructure first.
 
> Unfortunately due to the nature of our business, I can't post the data
> files online, I hope you can understand.  But I really appreciate your
> suggestions.
> 
> The thing that gets me is that it seems to work just fine for most
> files, but then not others.  Labeling normal files as corrupt hurts us
> as we then skip getting any log data from those files.
> 
> appreciate all your help.

I've written an autocorruption script, 

import sys
import subprocess
import tarfile

def process(source, dest, data):
    for pos in range(len(data)):
        for bit in range(8):
            new_data = data[:pos] + chr(ord(data[pos]) ^ (1<<bit)) + 
data[pos+1:]
            assert len(data) == len(new_data)
            out = open(dest, "w")
            out.write(new_data)
            out.close()
            try:
                t = tarfile.open(dest)
                for f in t:
                    t.extractfile(f)
            except Exception, e:
                if 0 == subprocess.call(["tar", "-xf", dest]):
                    return pos, bit

if __name__ == "__main__":
    source, dest = sys.argv[1:]
    data = open(source).read()
    print process(source, dest, data)

and I can indeed construct an archive that is rejected by tarfile, but not 
by tar. My working hypothesis is that the python library is a bit stricter 
in what it accepts...

Peter




More information about the Python-list mailing list