Mailman 3 PEP 552: deterministic pycs - Python-Dev

PEP 552: deterministic pycs

older
PEP 549: Instance Properties (aka:...

Benjamin Peterson

7 Sep 2017 7 Sep '17

8:39 p.m.

Hello, I've written a short PEP about an import extension to allow pycs to be more deterministic by optional replacing the timestamp with a hash of the source file: https://www.python.org/dev/peps/pep-0552/ Thanks for reading, Benjamin P.S. I came up with the idea for this PEP while awake.

Show replies by date

Antoine Pitrou

7 Sep 7 Sep

9 p.m.

On Thu, 07 Sep 2017 13:39:21 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...

Hello, I've written a short PEP about an import extension to allow pycs to be more deterministic by optional replacing the timestamp with a hash of the source file: https://www.python.org/dev/peps/pep-0552/

Why isn't https://github.com/python/cpython/pull/296 a good enough solution to this problem? It has a simple implementation, and requires neither maintaining two different pyc formats nor reading the entire source file to check whether the pyc file is up to date. Regards Antoine.

Benjamin Peterson

9:08 p.m.

On Thu, Sep 7, 2017, at 14:00, Antoine Pitrou wrote:

...

On Thu, 07 Sep 2017 13:39:21 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...
Hello, I've written a short PEP about an import extension to allow pycs to be more deterministic by optional replacing the timestamp with a hash of the source file: https://www.python.org/dev/peps/pep-0552/

Why isn't https://github.com/python/cpython/pull/296 a good enough solution to this problem? It has a simple implementation, and requires neither maintaining two different pyc formats nor reading the entire source file to check whether the pyc file is up to date.

The main objection to that model is that it requires modifying source timestamps, which isn't possible for builds on read-only source trees. This proposal also allows reproducible builds even if the files are being modified in an edit-run-tests cycle.

Freddy Rietdijk

9:19 p.m.

...

The main objection to that model is that it requires modifying source timestamps, which isn't possible for builds on read-only source trees.

Why not set the source timestamps of the source trees to say 1 first? That's what is done with the Nix package manager. The Python interpreter is patched (mostly similar to the referred PR) and checks whether SOURCE_DATE_EPOCH is set, and if so, sets the mtime to 1. On Thu, Sep 7, 2017 at 11:08 PM, Benjamin Peterson <benjamin@python.org> wrote:

...

On Thu, Sep 7, 2017, at 14:00, Antoine Pitrou wrote:

...
On Thu, 07 Sep 2017 13:39:21 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...
Hello, I've written a short PEP about an import extension to allow pycs to be more deterministic by optional replacing the timestamp with a hash of the source file: https://www.python.org/dev/peps/pep-0552/

Why isn't https://github.com/python/cpython/pull/296 a good enough solution to this problem? It has a simple implementation, and requires neither maintaining two different pyc formats nor reading the entire source file to check whether the pyc file is up to date.

The main objection to that model is that it requires modifying source timestamps, which isn't possible for builds on read-only source trees. This proposal also allows reproducible builds even if the files are being modified in an edit-run-tests cycle. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ freddyrietdijk%40fridh.nl

Benjamin Peterson

9:25 p.m.

On Thu, Sep 7, 2017, at 14:19, Freddy Rietdijk wrote:

...

...
The main objection to that model is that it requires modifying source timestamps, which isn't possible for builds on read-only source trees.

Why not set the source timestamps of the source trees to say 1 first?

If the source-tree is readonly (because you don't want your build system to modify source files on principal), you cannot do that.

Antoine Pitrou

9:21 p.m.

On Thu, 07 Sep 2017 14:08:58 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...

On Thu, Sep 7, 2017, at 14:00, Antoine Pitrou wrote:

...
On Thu, 07 Sep 2017 13:39:21 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...
Hello, I've written a short PEP about an import extension to allow pycs to be more deterministic by optional replacing the timestamp with a hash of the source file: https://www.python.org/dev/peps/pep-0552/

Why isn't https://github.com/python/cpython/pull/296 a good enough solution to this problem? It has a simple implementation, and requires neither maintaining two different pyc formats nor reading the entire source file to check whether the pyc file is up to date.

The main objection to that model is that it requires modifying source timestamps, which isn't possible for builds on read-only source trees.

Not sure how common that situation is (certainly the source tree wasn't read-only when you checked it out or untar'ed it), but isn't it easily circumvented by copying the source tree before building?

...

This proposal also allows reproducible builds even if the files are being modified in an edit-run-tests cycle.

I don't follow you here. Could you elaborate? Thanks Antoine.

Benjamin Peterson

9:32 p.m.

On Thu, Sep 7, 2017, at 14:21, Antoine Pitrou wrote:

...

On Thu, 07 Sep 2017 14:08:58 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...
On Thu, Sep 7, 2017, at 14:00, Antoine Pitrou wrote:

...
On Thu, 07 Sep 2017 13:39:21 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...
Hello, I've written a short PEP about an import extension to allow pycs to be more deterministic by optional replacing the timestamp with a hash of the source file: https://www.python.org/dev/peps/pep-0552/

Why isn't https://github.com/python/cpython/pull/296 a good enough solution to this problem? It has a simple implementation, and requires neither maintaining two different pyc formats nor reading the entire source file to check whether the pyc file is up to date.

The main objection to that model is that it requires modifying source timestamps, which isn't possible for builds on read-only source trees.

Not sure how common that situation is (certainly the source tree wasn't read-only when you checked it out or untar'ed it), but isn't it easily circumvented by copying the source tree before building?

Well, yes, in these kind of "batch" build situations, copying is probably fine. However, I want to be able to have pyc determinism even when developing. Copying the entire source every time I change something isn't a nice.

...

...
This proposal also allows reproducible builds even if the files are being modified in an edit-run-tests cycle.

I don't follow you here. Could you elaborate?

If you require source timestamps to be fixed and deterministic, Python won't notice when a file is modified. The larger point is that while the SOURCE_EPOCH patch will likely work for Linux distributions, I'm interested in being able to have deterministic pycs in "normal" Python development workflows.

Antoine Pitrou

9:54 p.m.

On Thu, 07 Sep 2017 14:32:19 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...

...
Not sure how common that situation is (certainly the source tree wasn't read-only when you checked it out or untar'ed it), but isn't it easily circumvented by copying the source tree before building?

Well, yes, in these kind of "batch" build situations, copying is probably fine. However, I want to be able to have pyc determinism even when developing. Copying the entire source every time I change something isn't a nice.

Hmm... Are you developing from a read-only source tree?

...

The larger point is that while the SOURCE_EPOCH patch will likely work for Linux distributions, I'm interested in being able to have deterministic pycs in "normal" Python development workflows.

That's an interesting idea, but is there a concrete motivation or is it platonical? After all, if you're changing something in the source tree it's expected that the overall "signature" of the build will be modified too. Regards Antoine.

Benjamin Peterson

10:44 p.m.

On Thu, Sep 7, 2017, at 14:54, Antoine Pitrou wrote:

...

On Thu, 07 Sep 2017 14:32:19 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...
...
Not sure how common that situation is (certainly the source tree wasn't read-only when you checked it out or untar'ed it), but isn't it easily circumvented by copying the source tree before building?

Well, yes, in these kind of "batch" build situations, copying is probably fine. However, I want to be able to have pyc determinism even when developing. Copying the entire source every time I change something isn't a nice.

Hmm... Are you developing from a read-only source tree?

No, but the build system is building from one (at least conceptually).

...

...
The larger point is that while the SOURCE_EPOCH patch will likely work for Linux distributions, I'm interested in being able to have deterministic pycs in "normal" Python development workflows.

That's an interesting idea, but is there a concrete motivation or is it platonical? After all, if you're changing something in the source tree it's expected that the overall "signature" of the build will be modified too.

Yes, I have used Bazel to build pycs. Having pycs be deterministic allows interesting build system optimizations like Bazel distributed caching to work well for Python.

Guido van Rossum

9:19 p.m.

Nice one. It would be nice to specify the various APIs needed as well. Why do you keep the mtime-based format as an option? (Maybe because it's faster? Did you measure it?) On Thu, Sep 7, 2017 at 1:39 PM, Benjamin Peterson <benjamin@python.org> wrote:

...

Hello, I've written a short PEP about an import extension to allow pycs to be more deterministic by optional replacing the timestamp with a hash of the source file: https://www.python.org/dev/peps/pep-0552/

Thanks for reading, Benjamin

P.S. I came up with the idea for this PEP while awake. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Benjamin Peterson

9:40 p.m.

On Thu, Sep 7, 2017, at 14:19, Guido van Rossum wrote:

...

Nice one.

It would be nice to specify the various APIs needed as well.

The compileall and py_compile ones?

...

Why do you keep the mtime-based format as an option? (Maybe because it's faster? Did you measure it?)

I haven't actually measured anything, but stating a file will definitely be faster than reading it completely and hashing it. I suppose if the speed difference between timestamp-based and hash-based pycs turned out to be small we could feel good about dropping the timestamp format completely. However, that difference might be hard to determine definitely as I expect the speed hit will vary widely based on system parameters such as disk speed and page cache size. My goal in this PEP was to preserve the current pyc invalidation behavior, which works well today for many use cases, as the default. The hash-based pycs are reserved for distribution and other power use cases.

Guido van Rossum

9:43 p.m.

On Thu, Sep 7, 2017 at 2:40 PM, Benjamin Peterson <benjamin@python.org> wrote:

...

On Thu, Sep 7, 2017, at 14:19, Guido van Rossum wrote:

...
Nice one.

It would be nice to specify the various APIs needed as well.

The compileall and py_compile ones?

Yes, and the SipHash mod to specify the key you mentioned.

...

...
Why do you keep the mtime-based format as an option? (Maybe because it's faster? Did you measure it?)

I haven't actually measured anything, but stating a file will definitely be faster than reading it completely and hashing it. I suppose if the speed difference between timestamp-based and hash-based pycs turned out to be small we could feel good about dropping the timestamp format completely. However, that difference might be hard to determine definitely as I expect the speed hit will vary widely based on system parameters such as disk speed and page cache size.

My goal in this PEP was to preserve the current pyc invalidation behavior, which works well today for many use cases, as the default. The hash-based pycs are reserved for distribution and other power use cases.

OK, maybe you can clarify that a bit in the PEP. -- --Guido van Rossum (python.org/~guido)

Benjamin Peterson

10:46 p.m.

On Thu, Sep 7, 2017, at 14:43, Guido van Rossum wrote:

...

On Thu, Sep 7, 2017 at 2:40 PM, Benjamin Peterson <benjamin@python.org> wrote:

...
On Thu, Sep 7, 2017, at 14:19, Guido van Rossum wrote:

...
Nice one.

It would be nice to specify the various APIs needed as well.

The compileall and py_compile ones?

Yes, and the SipHash mod to specify the key you mentioned.

Done.

...

...
...
Why do you keep the mtime-based format as an option? (Maybe because it's faster? Did you measure it?)

I haven't actually measured anything, but stating a file will definitely be faster than reading it completely and hashing it. I suppose if the speed difference between timestamp-based and hash-based pycs turned out to be small we could feel good about dropping the timestamp format completely. However, that difference might be hard to determine definitely as I expect the speed hit will vary widely based on system parameters such as disk speed and page cache size.

My goal in this PEP was to preserve the current pyc invalidation behavior, which works well today for many use cases, as the default. The hash-based pycs are reserved for distribution and other power use cases.

OK, maybe you can clarify that a bit in the PEP.

I've added a paragraph to the Rationale section.

Gregory P. Smith

11:58 p.m.

+1 on this PEP. The TL;DR summary of this PEP: The pyc date+length metadata check was a convenient hack. It still works well for many people and use cases, it isn't going away. PEP 552 proposes a new alternate hack that relies on file contents instead of os and filesystem date metadata. Assumption: The hash function is significantly faster than re-parsing the source. (guaranteed to be true) Questions: Input from OS package distributors would be interesting. Would they use this? Which way would it impact their startup time (loading the .py file vs just statting it. does that even matter? source files are often eventually loaded for linecache use in tracebacks anyways)? Would they benefit from a pyc that can contain _both_ timestamp+length, and the source_hash? if both were present, I assume that only one would be checked at startup. i'm not sure what would make the decision of what to check. one fails, check the other? i personally do not have a use for this case so i'd omit the complexity without a demonstrated need. Something to also state in the PEP: This is intentionally not a "secure" hash. Security is explicitly a non-goal. Rationale behind my support: We use a superset of Bazel at Google (unsurprising) and have had to jump through a lot of messy hoops to deal with timestamp metadata winding up in output files vs deterministic builds. What Benjamin describes here sounds exactly like what we would want. It allows deterministic builds in distributed build and cached operation systems where timestamps are never going to be guaranteed. It allows the check to work on filesystems which do not preserve timestamps. Also importantly, it allows the check to be disabled via the check_source bit. Today we use a modified importer at work that skips checking timestamps anyways as the way we ship applications where the entire set of dependencies present is already guaranteed at build time to be correct and being modified at runtime is not possible or not a concern. This PEP would avoid the need for an extra importer or modified interpreter logic to make this happen. -G On Thu, Sep 7, 2017 at 3:47 PM Benjamin Peterson <benjamin@python.org> wrote:

...

On Thu, Sep 7, 2017, at 14:43, Guido van Rossum wrote:

...
On Thu, Sep 7, 2017 at 2:40 PM, Benjamin Peterson <benjamin@python.org> wrote:

...
On Thu, Sep 7, 2017, at 14:19, Guido van Rossum wrote:

...
Nice one.

It would be nice to specify the various APIs needed as well.

The compileall and py_compile ones?

Yes, and the SipHash mod to specify the key you mentioned.

Done.

...
...
...
Why do you keep the mtime-based format as an option? (Maybe because

it's

...
...
...
faster? Did you measure it?)

I haven't actually measured anything, but stating a file will definitely be faster than reading it completely and hashing it. I suppose if the speed difference between timestamp-based and hash-based pycs turned out to be small we could feel good about dropping the timestamp format completely. However, that difference might be hard to determine definitely as I expect the speed hit will vary widely based on system parameters such as disk speed and page cache size.

My goal in this PEP was to preserve the current pyc invalidation behavior, which works well today for many use cases, as the default. The hash-based pycs are reserved for distribution and other power use cases.

OK, maybe you can clarify that a bit in the PEP.

I've added a paragraph to the Rationale section. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg%40krypto.org

Nick Coghlan

8 Sep 8 Sep

1:47 a.m.

On 7 September 2017 at 16:58, Gregory P. Smith <greg@krypto.org> wrote:

...

+1 on this PEP.

The TL;DR summary of this PEP: The pyc date+length metadata check was a convenient hack. It still works well for many people and use cases, it isn't going away. PEP 552 proposes a new alternate hack that relies on file contents instead of os and filesystem date metadata. Assumption: The hash function is significantly faster than re-parsing the source. (guaranteed to be true)

Questions:

Input from OS package distributors would be interesting. Would they use this? Which way would it impact their startup time (loading the .py file vs just statting it. does that even matter? source files are often eventually loaded for linecache use in tracebacks anyways)?

Christian and I asked some of our security folks for their personal wishlists recently, and one of the items that came up was "The recompile is based on a timestamp. How do you know the pyc file on disk really is related to the py file that is human readable? Can it be based on a hash or something like that?" This is a restating of the reproducible build use case: for a given version of Python, a given source file should always give the same source hash and marshaled code object, and once it does, it's easier to do an independent compilation from the source file and check you get the same answer. While you can implement that for timestamp based formats by adjusting input file metadata (and that's exactly what distros do with _SOURCE_DATE_EPOCH), it's still pretty annoying, and not particularly build cache friendly, since the same file in different source artifacts may produce different build outputs.

...

Would they benefit from a pyc that can contain _both_ timestamp+length, and the source_hash? if both were present, I assume that only one would be checked at startup. i'm not sure what would make the decision of what to check. one fails, check the other? i personally do not have a use for this case so i'd omit the complexity without a demonstrated need.

I don't see any way we'd benefit from having both items present. However, I do wonder whether we could encode *all* the mode settings into the magic number, such that we did something like reserving the top 3 bits for format flags: * number & 0x1FFF -> the traditional magic number * number & 0x8000 -> timestamp or hash? * number & 0x4000 -> checked or not? * number & 0x2000 -> reserved for future format changes By default we'd still produce the checked-timestamp format, but managed build systems (including Linux distros) could opt-in to the unchecked-hash format.

...

Something to also state in the PEP:

This is intentionally not a "secure" hash. Security is explicitly a non-goal.

I don't think it's so much that security is a non-goal, as that the (admittedly minor) security improvement comes from making it easier to reproduce the expected machine-readable output from a given human-readable input, rather than from the nature of the hashing function used.

...

Rationale behind my support:

+1 from me as well, for the reasons Greg gives (while Fedora doesn't currently do any per-file build artifact caching, I hope we will in the future, and output formats based on input artifact hashes will make that much easier than formats based on input timestamps). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Antoine Pitrou

10:04 a.m.

On Thu, 7 Sep 2017 18:47:20 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...

However, I do wonder whether we could encode *all* the mode settings into the magic number, such that we did something like reserving the top 3 bits for format flags:

* number & 0x1FFF -> the traditional magic number * number & 0x8000 -> timestamp or hash? * number & 0x4000 -> checked or not? * number & 0x2000 -> reserved for future format changes

I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files. Regards Antoine.

Antoine Pitrou

10:38 a.m.

New subject: PEP 552: single magic number

On Fri, 8 Sep 2017 12:04:52 +0200 Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Thu, 7 Sep 2017 18:47:20 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
However, I do wonder whether we could encode *all* the mode settings into the magic number, such that we did something like reserving the top 3 bits for format flags:

* number & 0x1FFF -> the traditional magic number * number & 0x8000 -> timestamp or hash? * number & 0x4000 -> checked or not? * number & 0x2000 -> reserved for future format changes

I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files.

Let me expand a bit on this. Currently, the format is: - bytes 0..3: magic number - bytes 4..7: source file timestamp - bytes 8..11: source file size - bytes 12+: pyc file body (marshal format) What I'm proposing is: - bytes 0..3: magic number - bytes 4..7: header options (bitfield) - bytes 8..15: header contents Depending on header options: - bytes 8..11: source file timestamp - bytes 12..15: source file size or: - bytes 8..15: 64-bit source file hash - bytes 16+: pyc file body (marshal format) This way, we keep a single magic number, a single header size, and there's only a per-build variation in the middle of the header. Of course, there are possible ways to encode information. For example, the header could be a sequence of Type-Length-Value triplets, perhaps prefixed with header size or body offset for easy seeking. My whole point here is that we can easily avoid the annoyance of dual magic numbers and encodings which must be maintained in parallel. Regards Antoine.

Guido van Rossum

2:40 p.m.

New subject: PEP 552: single magic number

I also like having the header fixed-size, so it might be possible to rewrite headers (e.g. to flip the source bit) without moving the rest of the file. On Fri, Sep 8, 2017 at 3:38 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Fri, 8 Sep 2017 12:04:52 +0200 Antoine Pitrou <solipsis@pitrou.net> wrote:

...
On Thu, 7 Sep 2017 18:47:20 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
However, I do wonder whether we could encode *all* the mode settings into the magic number, such that we did something like reserving the top 3 bits for format flags:

* number & 0x1FFF -> the traditional magic number * number & 0x8000 -> timestamp or hash? * number & 0x4000 -> checked or not? * number & 0x2000 -> reserved for future format changes

I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files.

Let me expand a bit on this. Currently, the format is:

- bytes 0..3: magic number - bytes 4..7: source file timestamp - bytes 8..11: source file size - bytes 12+: pyc file body (marshal format)

What I'm proposing is:

- bytes 0..3: magic number - bytes 4..7: header options (bitfield) - bytes 8..15: header contents Depending on header options: - bytes 8..11: source file timestamp - bytes 12..15: source file size or: - bytes 8..15: 64-bit source file hash - bytes 16+: pyc file body (marshal format)

This way, we keep a single magic number, a single header size, and there's only a per-build variation in the middle of the header.

Of course, there are possible ways to encode information. For example, the header could be a sequence of Type-Length-Value triplets, perhaps prefixed with header size or body offset for easy seeking.

My whole point here is that we can easily avoid the annoyance of dual magic numbers and encodings which must be maintained in parallel.

Regards

Antoine.

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Nick Coghlan

2:49 p.m.

On 8 September 2017 at 03:04, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Thu, 7 Sep 2017 18:47:20 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
However, I do wonder whether we could encode *all* the mode settings into the magic number, such that we did something like reserving the top 3 bits for format flags:

* number & 0x1FFF -> the traditional magic number * number & 0x8000 -> timestamp or hash? * number & 0x4000 -> checked or not? * number & 0x2000 -> reserved for future format changes

I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files.

One of Benjamin's goals was for the existing timestamp-based pyc format to remain completely unchanged, so we need some kind of marker in the magic number to indicate whether the file is using the new format or nor. I'd also be fine with using a single bit for that, such that the only bitmasking needed was: * number & 0x8000 -> legacy format or new format? * number & 0x7FFF -> the magic number itself And any further flags would go in a separate field. That's essentially what PEP 552 already suggests, the only adjustment is the idea of specifically using the high order bit in the magic number field to indicate the pyc format in use rather than leaving the explanation of how the two magic numbers will differ unspecified. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Antoine Pitrou

2:55 p.m.

On Fri, 8 Sep 2017 07:49:46 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 8 September 2017 at 03:04, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
On Thu, 7 Sep 2017 18:47:20 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
However, I do wonder whether we could encode *all* the mode settings into the magic number, such that we did something like reserving the top 3 bits for format flags:

* number & 0x1FFF -> the traditional magic number * number & 0x8000 -> timestamp or hash? * number & 0x4000 -> checked or not? * number & 0x2000 -> reserved for future format changes

I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files.

One of Benjamin's goals was for the existing timestamp-based pyc format to remain completely unchanged, so we need some kind of marker in the magic number to indicate whether the file is using the new format or nor.

I don't think that's a useful goal, as long as we bump the magic number. Note the header format was already changed in the past when we added a "size" field beside the "timestamp" field, to resolve collisions due to timestamp granularity. Regards Antoine.

Nick Coghlan

4:43 p.m.

On 8 September 2017 at 07:55, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Fri, 8 Sep 2017 07:49:46 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
...
I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files.

One of Benjamin's goals was for the existing timestamp-based pyc format to remain completely unchanged, so we need some kind of marker in the magic number to indicate whether the file is using the new format or nor.

I don't think that's a useful goal, as long as we bump the magic number.

Yeah, we (me, Benjamin, Greg) discussed that here, and we agree - there isn't actually any benefit to keeping the timestamp based pyc's using the same layout, since the magic number is already going to change anyway. Given that, I think your suggested 16 byte header layout would be a good one: 4 byte magic number, 4 bytes reserved for format flags, 8 bytes with an interpretation that depends on the format flags. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Benjamin Peterson

5:52 p.m.

Thank you all for the feedback. I've now updated the PEP to specify a 4-word pyc header with a bit field in every case. On Fri, Sep 8, 2017, at 09:43, Nick Coghlan wrote:

...

On 8 September 2017 at 07:55, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
On Fri, 8 Sep 2017 07:49:46 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
...
I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files.

One of Benjamin's goals was for the existing timestamp-based pyc format to remain completely unchanged, so we need some kind of marker in the magic number to indicate whether the file is using the new format or nor.

I don't think that's a useful goal, as long as we bump the magic number.

Yeah, we (me, Benjamin, Greg) discussed that here, and we agree - there isn't actually any benefit to keeping the timestamp based pyc's using the same layout, since the magic number is already going to change anyway.

Given that, I think your suggested 16 byte header layout would be a good one: 4 byte magic number, 4 bytes reserved for format flags, 8 bytes with an interpretation that depends on the format flags.

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/benjamin%40python.org

Brett Cannon

20 Sep 20 Sep

10:37 p.m.

On Fri, 8 Sep 2017 at 10:53 Benjamin Peterson <benjamin@python.org> wrote:

...

Thank you all for the feedback. I've now updated the PEP to specify a 4-word pyc header with a bit field in every case.

On Fri, Sep 8, 2017, at 09:43, Nick Coghlan wrote:

...
On 8 September 2017 at 07:55, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
On Fri, 8 Sep 2017 07:49:46 -0700 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
...
I'd rather a single magic number and a separate bitfield that tells what the header encodes exactly. We don't *have* to fight for a tiny size reduction of pyc files.

One of Benjamin's goals was for the existing timestamp-based pyc format to remain completely unchanged, so we need some kind of marker in the magic number to indicate whether the file is using the new format or nor.

I don't think that's a useful goal, as long as we bump the magic number.

Yeah, we (me, Benjamin, Greg) discussed that here, and we agree - there isn't actually any benefit to keeping the timestamp based pyc's using the same layout, since the magic number is already going to change anyway.

Given that, I think your suggested 16 byte header layout would be a good one: 4 byte magic number, 4 bytes reserved for format flags, 8 bytes with an interpretation that depends on the format flags.

+1 from me!

Benjamin Peterson

8 Sep 8 Sep

1:54 a.m.

On Thu, Sep 7, 2017, at 16:58, Gregory P. Smith wrote:

...

+1 on this PEP.

Thanks!

...

Questions:

Input from OS package distributors would be interesting. Would they use this? Which way would it impact their startup time (loading the .py file vs just statting it. does that even matter? source files are often eventually loaded for linecache use in tracebacks anyways)?

I an anticipate distributors will use the mode where the pyc is simply trusted and the source file isn't hashed. That would make the io overhead identical to today.

...

Would they benefit from a pyc that can contain _both_ timestamp+length, and the source_hash? if both were present, I assume that only one would be checked at startup. i'm not sure what would make the decision of what to check. one fails, check the other? i personally do not have a use for this case so i'd omit the complexity without a demonstrated need.

Yeah, it could act as a multi-tiered cache key. I agree with your conclusion to pass for now.

...

Something to also state in the PEP:

This is intentionally not a "secure" hash. Security is explicitly a non-goal.

Added a sentence.

Barry Warsaw

2:13 a.m.

On Sep 7, 2017, at 16:58, Gregory P. Smith <greg@krypto.org> wrote:

...

Input from OS package distributors would be interesting. Would they use this?

I suspect it won’t be that interesting to the Debian ecosystem, since we generate pyc files on package install. We do that because we can support multiple versions of Python installed simultaneously and we don’t know which versions are installed on the target machine. I suppose our stdlib package could ship pycs, but we don’t. Reproducible builds may still be interesting in other situations though, such as CI machines, but then SOURCE_DATE_EPOCH is probably good enough. -Barry

Antoine Pitrou

7 Sep 7 Sep

9:56 p.m.

On Thu, 07 Sep 2017 14:40:33 -0700 Benjamin Peterson <benjamin@python.org> wrote:

...

On Thu, Sep 7, 2017, at 14:19, Guido van Rossum wrote:

...
Nice one.

It would be nice to specify the various APIs needed as well.

The compileall and py_compile ones?

...
Why do you keep the mtime-based format as an option? (Maybe because it's faster? Did you measure it?)

I haven't actually measured anything, but stating a file will definitely be faster than reading it completely and hashing it. I suppose if the speed difference between timestamp-based and hash-based pycs turned out to be small we could feel good about dropping the timestamp format completely. However, that difference might be hard to determine definitely as I expect the speed hit will vary widely based on system parameters such as disk speed and page cache size.

Also, while some/many of us have fast development machines with performant SSDs, Python can be used in situations where "disk" I/O is still slow (imagine a Raspberry Pi system or similar, grinding through a SD card or USB key to load py and pyc files). Regards Antoine.

2631

Age (days ago)

2644

Last active (days ago)

List overview

Download

25 comments

8 participants

participants (8)

Antoine Pitrou
Barry Warsaw
Benjamin Peterson
Brett Cannon
Freddy Rietdijk
Gregory P. Smith
Guido van Rossum
Nick Coghlan

PEP 552: deterministic pycs

tags

participants (8)