[AstroPy] DATAMD5 calculation

Wed May 25 10:14:44 EDT 2011

On 05/25/2011 01:25 AM, Ole Streicher wrote:
> Am 25.05.2011 00:10, schrieb Erik Bray:
>> On 05/24/2011 05:51 PM, Ole Streicher wrote:
>>> Am 24.05.2011 23:13, schrieb Erik Bray:
>>>> I don't think pyfits has anything built in for handling MD5 sums.  Is
>>>> there some particular standard this relates to, such that it would be a
>>>> good feature to add?
>>>
>>> Maybe. At least, I know this from the ESO pipelines that they add a
>>> DATAMD5 keyword to the primary header. And since this keyword is not
>>> prefixed with "HIERARCH ESO", I guessed that it is some standard.
>>> However, google dit not point me to more information.
>>
>> I'll look into it...
>
> It seems to come from the "qfits" package. There is a program "fitsmd5"
> from this library which seems to compute it.
>>
>>>> At any rate, in the meantime you can use hashlib to generate a checksum
>>>> on hdu.data--no need to use any internal attributes:
>>>
>>> This is unfortunately not compatible to the files I already have (from
>>> an ESO pipeline).
>>
>> Strange...  could it be that they're using the header plus the data for
>> the MD5 sum?
>
> The code I found calcs for every HDU the data from hdu._datLoc to
> hdu._datLoc + hdu._datSpan. Since hdu._datSpan is different from
> hdu.data.nbytes (they differ by 832 bytes), the results must be different.
>
> Unfortunately, I dont know the meaning of _datLoc and _datSpan with
> respect to the data, so I dont know how to reconstruct the data without
> a file.
>
> Best
>
> Ole

I took a look at the fitsmd5 utility, and as you said it's the MD5 sum 
of all the data sections.  Where PyFITS is concerned, _datLoc is just 
the offset within the file where that HDU's data section begins, and 
_datSpan is the length of the data section.  Since FITS files are read 
in 2880 byte blocks, there can be some padding at the end of the data, 
which is why it may not be exactly the same as hdu.data.nbytes.

But it looks like fitsmd5 includes that padding in the sum.  The padding 
is just null bytes, so you can still add that yourself.  For example:

FITS_BLOCK_SIZE = 2880
md5sum.update(hdu.data)
pad = '\0' * (FITS_BLOCK_SIZE - hdu.data.nbytes % FITS_BLOCK_SIZE)
md5sum.update(pad)

And do that for each HDU.  I think that would do it.

Erik