[Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp

Joachim Wuttke j.wuttke at fz-juelich.de
Wed Apr 14 16:36:25 EDT 2021


If argument fname of savetxt(fname, X, ...) ends with ".gz" then
array X is not only converted to text, but also compressed using gzip.

The format gzip [1] has a timestamp. The Python module gzip.py [2]
sets the timestamp according to an optional constructor argument
"mtime". By default, the current time is used.

This makes the file written by savetxt(*.gz, ...) non-deterministic.
This is unexpected and confusing in a numerics context.

I let different versions of a program generate *.gz files, and ran
the "diff" util over pairs of output files to check whether any bit
had changed. To my surprise, confusion, and desperation, output
always had changed, and kept changing when I ran unchanged versions
of my program over and again. So I learned the hard way that the
*.gz files contain a timestamp.

Regarding the module gzip.py, I submitted a pull request to improve
description of the optional argument mtime, and hint at the possible
choice mtime = 0 that makes outputs deterministic [3].

Regarding numpy, I'd propose a bolder measure:
To let savetxt(fname, X, ...) store exactly the same information in
compressed and uncompressed files, always invoke gzip with mtime = 0.

I would like to follow up with a pull request, but I am unable to
find out how numpy.savetxt is invoking gzip.

Joachim

[1] https://www.ietf.org/rfc/rfc1952.txt
[2] https://docs.python.org/3/library/gzip.html
[3] https://github.com/python/cpython/pull/25410

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5338 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20210414/511cf8ee/attachment.bin>


More information about the NumPy-Discussion mailing list