Problem with tarfile module to open *.tar.gz files - unreliable ?
ahleniusm at gmail.com
Fri Aug 20 13:57:36 CEST 2010
On Aug 20, 5:34 am, Dave Angel <da... at ieee.org> wrote:
> m_ahlenius wrote:
> > Hi,
> > I am relatively new to doing serious work in python. I am using it to
> > access a large number of log files. Some of the logs get corrupted
> > and I need to detect that when processing them. This code seems to
> > work for quite a few of the logs (all same structure) It also
> > correctly identifies some corrupt logs but then it identifies others
> > as being corrupt when they are not.
> > example error msg from below code:
> > Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
> > Exception: CRC check\
> > failed 0x8967e931 != 0x4e5f1036L
> > When I manually examine the supposed corrupt log file and use
> > "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz " on it, it opens
> > just fine.
> > Is there anything wrong with how I am using this module? (extra code
> > removed for clarity)
> > if tarfile.is_tarfile( file ):
> > try:
> > xf = tarfile.open( file, "r:gz" )
> > for locFile in xf:
> > logfile = xf.extractfile( locFile )
> > validFileFlag = True
> > # iterate through each log file, grab the first and
> > the last lines
> > lines = iter( logfile )
> > firstLine = lines.next()
> > for nextLine in lines:
> > ....
> > continue
> > logfile.close()
> > ...
> > xf.close()
> > except Exception, e:
> > validFileFlag = False
> > msg = "\nCould not open the log file: " + repr(file) + "
> > Exception: " + str(e) + "\n"
> > else:
> > validFileFlag = False
> > lTime = extractFileNameTime( file )
> > msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
> > \n"
> > print msg
> I haven't used tarfile, but this feels like a problem with the Win/Unix
> line endings. I'm going to assume you're running on Windows, which
> could trigger the problem I'm going to describe.
> You use 'file' to hold something, but don't show us what. In fact, it's
> a lousy name, since it's already a Python builtin. But if it's holding
> fileobj, that you've separately opened, then you need to change that
> open to use mode 'rb'
> The problem, if I've guessed right, is that occasionally you'll
> accidentally encounter a 0d0a sequence in the middle of the (binary)
> compressed data. If you're on Windows, and use the default 'r' mode,
> it'll be changed into a 0a byte. Thus corrupting the checksum, and
> eventually the contents.
thanks for the comments - I'll change the variable name.
I am running this on linux so don't think its a Windows issue. So if
that's the case
is the 0d0a still an issue?
More information about the Python-list