gzip.py: allow deterministic compression (without time stamp)
gzip compression, using class GzipFile from gzip.py, by default inserts a timestamp to the compressed stream. If the optional argument `mtime` is absent or None, then the current time is used [1]. This makes outputs non-deterministic, which can badly confuse unsuspecting users: If you run "diff" over two outputs to see whether they are unaffected by changes in your application, then you would not expect that the *.gz binaries differ just because they were created at different times. I'd propose to introduce a new constant `NO_TIMESTAMP` as possible value of `mtime`. Furthermore, if policy about API changes allows, I'd suggest that `NO_TIMESTAMP` become the new default value for `mtime`. How to proceed from here? Is this the kind of proposals that has to go through a PEP? - Joachim [1] https://github.com/python/cpython/blob/6f1e8ccffa5b1272a36a35405d3c4e4bbba0c...
Hi, gzip.NO_TIMESTAMP sounds like a good idea. But I'm not sure about changing the default behavior. I would prefer to leave it unchanged. I guess that your problem is that you don't access gzip directly, but uses a higher level API which doesn't give access to the timestamp parameter, like the tarfile module? If your usecase is reproducible build, you may follow py_compile behavior: the default behavior depends if the SOURCE_DATE_EPOCH environment variable is set or not: def _get_default_invalidation_mode(): if os.environ.get('SOURCE_DATE_EPOCH'): return PycInvalidationMode.CHECKED_HASH else: return PycInvalidationMode.TIMESTAMP Victor On Wed, Apr 14, 2021 at 6:34 PM Joachim Wuttke <j.wuttke@fz-juelich.de> wrote:
gzip compression, using class GzipFile from gzip.py, by default inserts a timestamp to the compressed stream. If the optional argument `mtime` is absent or None, then the current time is used [1].
This makes outputs non-deterministic, which can badly confuse unsuspecting users: If you run "diff" over two outputs to see whether they are unaffected by changes in your application, then you would not expect that the *.gz binaries differ just because they were created at different times.
I'd propose to introduce a new constant `NO_TIMESTAMP` as possible value of `mtime`.
Furthermore, if policy about API changes allows, I'd suggest that `NO_TIMESTAMP` become the new default value for `mtime`.
How to proceed from here? Is this the kind of proposals that has to go through a PEP?
- Joachim
[1] https://github.com/python/cpython/blob/6f1e8ccffa5b1272a36a35405d3c4e4bbba0c...
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OTUGLATL... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
The gzip specification [1] makes clear that the mtime field is always present. The time is in Unix format, i.e., seconds since 00:00:00 GMT, Jan. 1, 1970. MTIME = 0 means no time stamp is available. Hence no need for a new constant NO_TIMESTAMP. So this is primarily a documentation problem [2]. For this, I will create a pull request to gzip.py. Joachim [1] https://www.ietf.org/rfc/rfc1952.txt [2] https://discuss.python.org/t/gzip-py-allow-deterministic-compression-without...
On Wed, 2021-04-14 at 18:06 +0000, j.wuttke@fz-juelich.de wrote:
The gzip specification [1] makes clear that the mtime field is always present. The time is in Unix format, i.e., seconds since 00:00:00 GMT, Jan. 1, 1970. MTIME = 0 means no time stamp is available. Hence no need for a new constant NO_TIMESTAMP.
So this is primarily a documentation problem [2]. For this, I will create a pull request to gzip.py.
I think having an extra constant (equal to 0) wouldn't hurt and could make the code a bit more explicit. -- Best regards, Michał Górny
If the so, then a better name than NO_TIMESTAMP should be chosen, as the gzip specification does not allow for no timestamp.
DEFAULT_TIMESTAMP? Kind regards, Steve On Wed, Apr 14, 2021 at 8:03 PM <j.wuttke@fz-juelich.de> wrote:
If the so, then a better name than NO_TIMESTAMP should be chosen, as the gzip specification does not allow for no timestamp. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/O3ENOZ5O... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, 14 Apr 2021 21:38:11 +0100 Steve Holden <steve@holdenweb.com> wrote:
DEFAULT_TIMESTAMP?
It's not a default timestamp, it's a placeholder value meaning "no timestamp". The aforementioned RFC 1952 explicitly says: "MTIME = 0 means no time stamp is available". So yes, it really means "no timestamp", regardless of the fact that it's encoded as integer value 0. Regards Antoine.
If gzip is modified to use SOURCE_DATE_EPOCH timestamp, you get a reproducible binary and you don't need to add a new constant ;-) SOURCE_DATE_EPOCH is a timestamp: number of seconds since Unix Epoch (January 1, 1970 at 00:00). Victor On Wed, Apr 14, 2021 at 8:15 PM <j.wuttke@fz-juelich.de> wrote:
The gzip specification [1] makes clear that the mtime field is always present. The time is in Unix format, i.e., seconds since 00:00:00 GMT, Jan. 1, 1970. MTIME = 0 means no time stamp is available. Hence no need for a new constant NO_TIMESTAMP.
So this is primarily a documentation problem [2]. For this, I will create a pull request to gzip.py.
Joachim
[1] https://www.ietf.org/rfc/rfc1952.txt [2] https://discuss.python.org/t/gzip-py-allow-deterministic-compression-without... _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LCPWERWI... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
On Thu, 15 Apr 2021 11:28:03 +0200 Victor Stinner <vstinner@python.org> wrote:
If gzip is modified to use SOURCE_DATE_EPOCH timestamp, you get a reproducible binary and you don't need to add a new constant ;-) SOURCE_DATE_EPOCH is a timestamp: number of seconds since Unix Epoch (January 1, 1970 at 00:00).
Changing the behaviour of a stdlib module based on an environment variable sounds a bit undesirable. That behaviour can be implemented at a higher-level in application code (for example the tarfile or zipfile command line). Regards Antoine.
SOURCE_DATE_EPOCH is not a random variable, but is a *standardised* environment variable: https://reproducible-builds.org/docs/source-date-epoch/ This page explains the rationale. See the “Lying about the time” / “violates language spec” section ;-) More and more projects adopted it. As I wrote, the Python stdlib already uses it in compileall and py_compile modules. Victor On Thu, Apr 15, 2021 at 12:34 PM Antoine Pitrou <antoine@python.org> wrote:
On Thu, 15 Apr 2021 11:28:03 +0200 Victor Stinner <vstinner@python.org> wrote:
If gzip is modified to use SOURCE_DATE_EPOCH timestamp, you get a reproducible binary and you don't need to add a new constant ;-) SOURCE_DATE_EPOCH is a timestamp: number of seconds since Unix Epoch (January 1, 1970 at 00:00).
Changing the behaviour of a stdlib module based on an environment variable sounds a bit undesirable. That behaviour can be implemented at a higher-level in application code (for example the tarfile or zipfile command line).
Regards
Antoine.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HPX62SVA... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
On Thu, 15 Apr 2021 14:32:05 +0200 Victor Stinner <vstinner@python.org> wrote:
SOURCE_DATE_EPOCH is not a random variable, but is a *standardised* environment variable: https://reproducible-builds.org/docs/source-date-epoch/
Standardized by whom? This is not a POSIX nor Windows standard at least. Just because a Web page claims it is standardized doesn't mean that it is.
More and more projects adopted it. As I wrote, the Python stdlib already uses it in compileall and py_compile modules.
Those are higher-level modules. Doing it in the gzip module directly sounds like the wrong place. Regards Antoine.
On 15 Apr 2021, at 14:48, Antoine Pitrou <antoine@python.org> wrote:
On Thu, 15 Apr 2021 14:32:05 +0200 Victor Stinner <vstinner@python.org> wrote:
SOURCE_DATE_EPOCH is not a random variable, but is a *standardised* environment variable: https://reproducible-builds.org/docs/source-date-epoch/
Standardized by whom? This is not a POSIX nor Windows standard at least. Just because a Web page claims it is standardized doesn't mean that it is.
More and more projects adopted it. As I wrote, the Python stdlib already uses it in compileall and py_compile modules.
Those are higher-level modules. Doing it in the gzip module directly sounds like the wrong place.
I agree. According to the documentation this variable is meant to be used for build tools to accomplish reproducible builds. This should IMHO not affect lower level APIs and libraries that aren’t build related. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/
On Wed, Apr 14, 2021 at 5:00 AM Joachim Wuttke <j.wuttke@fz-juelich.de> wrote:
gzip compression, using class GzipFile from gzip.py, by default inserts a timestamp to the compressed stream. If the optional argument `mtime` is absent or None, then the current time is used [1].
This makes outputs non-deterministic, which can badly confuse unsuspecting users: If you run "diff" over two outputs to see whether they are unaffected by changes in your application, then you would not expect that the *.gz binaries differ just because they were created at different times.
I'd propose to introduce a new constant `NO_TIMESTAMP` as possible value of `mtime`.
Furthermore, if policy about API changes allows, I'd suggest that `NO_TIMESTAMP` become the new default value for `mtime`.
How to proceed from here? Is this the kind of proposals that has to go through a PEP?
For something like this you would open an issue and see if a core developer is intrigued enough to work with you to see the change occur; no PEP is necessary. -Brett
- Joachim
[1]
https://github.com/python/cpython/blob/6f1e8ccffa5b1272a36a35405d3c4e4bbba0c...
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OTUGLATL... Code of Conduct: http://python.org/psf/codeofconduct/
participants (9)
-
Antoine Pitrou
-
Brett Cannon
-
j.wuttke@fz-juelich.de
-
Joachim Wuttke
-
Michał Górny
-
Ronald Oussoren
-
Steve Holden
-
Terry Reedy
-
Victor Stinner