Problem with tarfile module to open *.tar.gz files - unreliable ?

m_ahlenius ahleniusm at gmail.com
Fri Aug 20 12:44:29 EDT 2010


On Aug 20, 9:25 am, Peter Otten <__pete... at web.de> wrote:
> m_ahlenius wrote:
> > On Aug 20, 6:57 am, m_ahlenius <ahleni... at gmail.com> wrote:
> >> On Aug 20, 5:34 am, Dave Angel <da... at ieee.org> wrote:
>
> >> > m_ahlenius wrote:
> >> > > Hi,
>
> >> > > I am relatively new to doing serious work in python.  I am using it
> >> > > to access a large number of log files.  Some of the logs get
> >> > > corrupted and I need to detect that when processing them.  This code
> >> > > seems to work for quite a few of the logs (all same structure)  It
> >> > > also correctly identifies some corrupt logs but then it identifies
> >> > > others as being corrupt when they are not.
>
> >> > > example error msg from below code:
>
> >> > > Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
> >> > > Exception: CRC check\
> >> > > failed 0x8967e931 != 0x4e5f1036L
>
> >> > > When I manually examine the supposed corrupt log file and use
> >> > > "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz "  on it, it opens
> >> > > just fine.
>
> >> > > Is there anything wrong with how I am using this module?  (extra code
> >> > > removed for clarity)
>
> >> > > if tarfile.is_tarfile( file ):
> >> > > try:
> >> > > xf = tarfile.open( file, "r:gz" )
> >> > > for locFile in xf:
> >> > > logfile = xf.extractfile( locFile )
> >> > > validFileFlag = True
> >> > > # iterate through each log file, grab the first and
> >> > > the last lines
> >> > > lines = iter( logfile )
> >> > > firstLine = lines.next()
> >> > > for nextLine in lines:
> >> > > ....
> >> > > continue
>
> >> > > logfile.close()
> >> > > ...
> >> > > xf.close()
> >> > > except Exception, e:
> >> > > validFileFlag = False
> >> > > msg = "\nCould not open the log file: " + repr(file) + "
> >> > > Exception: " + str(e) + "\n"
> >> > > else:
> >> > > validFileFlag = False
> >> > > lTime = extractFileNameTime( file )
> >> > > msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
> >> > > \n"
> >> > > print msg
>
> >> > I haven't used tarfile, but this feels like a problem with the Win/Unix
> >> > line endings.  I'm going to assume you're running on Windows, which
> >> > could trigger the problem I'm going to describe.
>
> >> > You use 'file' to hold something, but don't show us what.  In fact,
> >> > it's a lousy name, since it's already a Python builtin.  But if it's
> >> > holding fileobj, that you've separately opened, then you need to change
> >> > that open to use mode 'rb'
>
> >> > The problem, if I've guessed right, is that occasionally you'll
> >> > accidentally encounter a 0d0a sequence in the middle of the (binary)
> >> > compressed data.  If you're on Windows, and use the default 'r' mode,
> >> > it'll be changed into a 0a byte.  Thus corrupting the checksum, and
> >> > eventually the contents.
>
> >> > DaveA
>
> >> Hi,
>
> >> thanks for the comments - I'll change the variable name.
>
> >> I am running this on linux so don't think its a Windows issue.  So if
> >> that's the case
> >> is the 0d0a still an issue?
>
> >> 'mark
>
> > Oh and what's stored currently in
> > The file var us just the unopened pathname to the
> > Target file I want to open
>
> Random questions:
>
> What python version are you using?
> If you have other python versions around, do they exhibit the same problem?
> If you extract and compress your data using the external tool, does the
> resulting file make problems in Python, too?
> If so, can you reduce data size and put a small demo online for others to
> experiment with?
>
> Peter

Hi,

I am using Python 2.6.5.

Unfortunately I don't have other versions installed so its hard to
test with a different version.

As for the log compression, its a bit hard to test.  Right now I may
process 100+ of these logs per night, and will get maybe 5 which are
reported as corrupt (typically a bad CRC) and 2 which it reported as a
bad tar archive.  This morning I checked each of the 7 reported
problem files by manually opening them with "tar -xzvof" and they were
all indeed corrupt. Sign.

Unfortunately due to the nature of our business, I can't post the data
files online, I hope you can understand.  But I really appreciate your
suggestions.

The thing that gets me is that it seems to work just fine for most
files, but then not others.  Labeling normal files as corrupt hurts us
as we then skip getting any log data from those files.

appreciate all your help.

'mark




More information about the Python-list mailing list