Problem with tarfile module to open *.tar.gz files - unreliable ?

Dave Angel davea at ieee.org
Fri Aug 20 12:34:27 CEST 2010


m_ahlenius wrote:
> Hi,
>
> I am relatively new to doing serious work in python.  I am using it to
> access a large number of log files.  Some of the logs get corrupted
> and I need to detect that when processing them.  This code seems to
> work for quite a few of the logs (all same structure)  It also
> correctly identifies some corrupt logs but then it identifies others
> as being corrupt when they are not.
>
> example error msg from below code:
>
> Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
> Exception: CRC check\
>  failed 0x8967e931 != 0x4e5f1036L
>
> When I manually examine the supposed corrupt log file and use
> "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz "  on it, it opens
> just fine.
>
> Is there anything wrong with how I am using this module?  (extra code
> removed for clarity)
>
>  if tarfile.is_tarfile( file ):
>         try:
>             xf = tarfile.open( file, "r:gz" )
>             for locFile in xf:
>                 logfile = xf.extractfile( locFile )
>                 validFileFlag = True
>                 # iterate through each log file, grab the first and
> the last lines
>                 lines = iter( logfile )
>                 firstLine = lines.next()
>                 for nextLine in lines:
>                     ....
>                         continue
>
>                 logfile.close()
>                  ...
>             xf.close()
>         except Exception, e:
>             validFileFlag = False
>             msg = "\nCould not open the log file: " + repr(file) + "
> Exception: " + str(e) + "\n"
>  else:
>         validFileFlag = False
>         lTime = extractFileNameTime( file )
>         msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
> \n"
>         print msg
>
>   
I haven't used tarfile, but this feels like a problem with the Win/Unix 
line endings.  I'm going to assume you're running on Windows, which 
could trigger the problem I'm going to describe.

You use 'file' to hold something, but don't show us what.  In fact, it's 
a lousy name, since it's already a Python builtin.  But if it's holding  
fileobj, that you've separately opened, then you need to change that 
open to use mode 'rb'

The problem, if I've guessed right, is that occasionally you'll 
accidentally encounter a 0d0a sequence in the middle of the (binary) 
compressed data.  If you're on Windows, and use the default 'r' mode, 
it'll be changed into a 0a byte.  Thus corrupting the checksum, and 
eventually the contents.

DaveA




More information about the Python-list mailing list