Problem with tarfile module to open *.tar.gz files - unreliable ?

m_ahlenius ahleniusm at gmail.com
Sat Aug 21 05:07:00 CEST 2010


On Aug 20, 12:55 pm, Peter Otten <__pete... at web.de> wrote:
> m_ahlenius wrote:
> > I am using Python 2.6.5.
>
> > Unfortunately I don't have other versions installed so its hard to
> > test with a different version.
>
> > As for the log compression, its a bit hard to test.  Right now I may
> > process 100+ of these logs per night, and will get maybe 5 which are
> > reported as corrupt (typically a bad CRC) and 2 which it reported as a
> > bad tar archive.  This morning I checked each of the 7 reported
> > problem files by manually opening them with "tar -xzvof" and they were
> > all indeed corrupt. Sign.
>
> So many corrupted files? I'd say you have to address the problem with your
> infrastructure first.
>
> > Unfortunately due to the nature of our business, I can't post the data
> > files online, I hope you can understand.  But I really appreciate your
> > suggestions.
>
> > The thing that gets me is that it seems to work just fine for most
> > files, but then not others.  Labeling normal files as corrupt hurts us
> > as we then skip getting any log data from those files.
>
> > appreciate all your help.
>
> I've written an autocorruption script,
>
> import sys
> import subprocess
> import tarfile
>
> def process(source, dest, data):
>     for pos in range(len(data)):
>         for bit in range(8):
>             new_data = data[:pos] + chr(ord(data[pos]) ^ (1<<bit)) +
> data[pos+1:]
>             assert len(data) == len(new_data)
>             out = open(dest, "w")
>             out.write(new_data)
>             out.close()
>             try:
>                 t = tarfile.open(dest)
>                 for f in t:
>                     t.extractfile(f)
>             except Exception, e:
>                 if 0 == subprocess.call(["tar", "-xf", dest]):
>                     return pos, bit
>
> if __name__ == "__main__":
>     source, dest = sys.argv[1:]
>     data = open(source).read()
>     print process(source, dest, data)
>
> and I can indeed construct an archive that is rejected by tarfile, but not
> by tar. My working hypothesis is that the python library is a bit stricter
> in what it accepts...
>
> Peter

Thanks - that's cool.

A friend of mine was suggesting that he's seen similar behaviour when
he uses Perl on these types of files when the OS (Unix) has not
finished writing them.  We have an rsync process which sync's our
servers for these files and then come down somewhat randomly.  So its
conceivable I think that this process could be trying to open a file
as its being written.  I know it sounds like a stretch but my guess is
that its a possibility.  I could verify that with the timestamps of
the errors in my log and the mod time on the original file.

'mark



More information about the Python-list mailing list