Problem with tarfile module to open *.tar.gz files - unreliable ?

Dave Angel davea at ieee.org
Fri Aug 20 10:10:13 EDT 2010


m_ahlenius wrote:
> On Aug 20, 6:57 am, m_ahlenius <ahleni... at gmail.com> wrote:
>   
>> On Aug 20, 5:34 am, Dave Angel <da... at ieee.org> wrote:
>>
>>
>>
>>
>>
>>     
>>> m_ahlenius wrote:
>>>       
>>>> Hi,
>>>>         
>>>> I am relatively new to doing serious work in python.  I am using it to
>>>> access a large number of log files.  Some of the logs get corrupted
>>>> and I need to detect that when processing them.  This code seems to
>>>> work for quite a few of the logs (all same structure)  It also
>>>> correctly identifies some corrupt logs but then it identifies others
>>>> as being corrupt when they are not.
>>>>         
>>>> example error msg from below code:
>>>>         
>>>> Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
>>>> Exception: CRC check\
>>>>  failed 0x8967e931 !=x4e5f1036L
>>>>         
>>>> When I manually examine the supposed corrupt log file and use
>>>> "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz "  on it, it opens
>>>> just fine.
>>>>         
>>>> Is there anything wrong with how I am using this module?  (extra code
>>>> removed for clarity)
>>>>         
>>>>  if tarfile.is_tarfile( file ):
>>>>         try:
>>>>             xf =arfile.open( file, "r:gz" )
>>>>             for locFile in xf:
>>>>                 logfile =f.extractfile( locFile )
>>>>                 validFileFlag =rue
>>>>                 # iterate through each log file, grab the first and
>>>> the last lines
>>>>                 lines =ter( logfile )
>>>>                 firstLine =ines.next()
>>>>                 for nextLine in lines:
>>>>                     ....
>>>>                         continue
>>>>         
>>>>                 logfile.close()
>>>>                  ...
>>>>             xf.close()
>>>>         except Exception, e:
>>>>             validFileFlag =alse
>>>>             msg =\nCould not open the log file: " + repr(file) + "
>>>> Exception: " + str(e) + "\n"
>>>>  else:
>>>>         validFileFlag =alse
>>>>         lTime =xtractFileNameTime( file )
>>>>         msg =>>>>>>> Warning " + file + " is NOT a valid tar archive
>>>> \n"
>>>>         print msg
>>>>         
>>> I haven't used tarfile, but this feels like a problem with the Win/Unix
>>> line endings.  I'm going to assume you're running on Windows, which
>>> could trigger the problem I'm going to describe.
>>>       
>>> You use 'file' to hold something, but don't show us what.  In fact, it's
>>> a lousy name, since it's already a Python builtin.  But if it's holding  
>>> fileobj, that you've separately opened, then you need to change that
>>> open to use mode 'rb'
>>>       
>>> The problem, if I've guessed right, is that occasionally you'll
>>> accidentally encounter a 0d0a sequence in the middle of the (binary)
>>> compressed data.  If you're on Windows, and use the default 'r' mode,
>>> it'll be changed into a 0a byte.  Thus corrupting the checksum, and
>>> eventually the contents.
>>>       
>>> DaveA
>>>       
>> Hi,
>>
>> thanks for the comments - I'll change the variable name.
>>
>> I am running this on linux so don't think its a Windows issue.  So if
>> that's the case
>> is the 0d0a still an issue?
>>
>> 'mark
>>     
>
> Oh and what's stored currently in
> The file var us just the unopened pathname to the
> Target file I want to open
>
>
>   
No, on Linux, there should be no such problem.  And I have to assume 
that if you pass the filename as a string, the library would use 'rb' 
anyway.  It's just if you pass a fileobj,  AND are on Windows.

Sorry I wasted your time, but nobody else had answered, and I hoped it 
might help.

DaveA




More information about the Python-list mailing list