+1 on this PEP. The TL;DR summary of this PEP: The pyc date+length metadata check was a convenient hack. It still works well for many people and use cases, it isn't going away. PEP 552 proposes a new alternate hack that relies on file contents instead of os and filesystem date metadata. Assumption: The hash function is significantly faster than re-parsing the source. (guaranteed to be true) Questions: Input from OS package distributors would be interesting. Would they use this? Which way would it impact their startup time (loading the .py file vs just statting it. does that even matter? source files are often eventually loaded for linecache use in tracebacks anyways)? Would they benefit from a pyc that can contain _both_ timestamp+length, and the source_hash? if both were present, I assume that only one would be checked at startup. i'm not sure what would make the decision of what to check. one fails, check the other? i personally do not have a use for this case so i'd omit the complexity without a demonstrated need. Something to also state in the PEP: This is intentionally not a "secure" hash. Security is explicitly a non-goal. Rationale behind my support: We use a superset of Bazel at Google (unsurprising) and have had to jump through a lot of messy hoops to deal with timestamp metadata winding up in output files vs deterministic builds. What Benjamin describes here sounds exactly like what we would want. It allows deterministic builds in distributed build and cached operation systems where timestamps are never going to be guaranteed. It allows the check to work on filesystems which do not preserve timestamps. Also importantly, it allows the check to be disabled via the check_source bit. Today we use a modified importer at work that skips checking timestamps anyways as the way we ship applications where the entire set of dependencies present is already guaranteed at build time to be correct and being modified at runtime is not possible or not a concern. This PEP would avoid the need for an extra importer or modified interpreter logic to make this happen. -G On Thu, Sep 7, 2017 at 3:47 PM Benjamin Peterson <benjamin@python.org> wrote:
On Thu, Sep 7, 2017, at 14:43, Guido van Rossum wrote:
On Thu, Sep 7, 2017 at 2:40 PM, Benjamin Peterson <benjamin@python.org> wrote:
On Thu, Sep 7, 2017, at 14:19, Guido van Rossum wrote:
Nice one.
It would be nice to specify the various APIs needed as well.
The compileall and py_compile ones?
Yes, and the SipHash mod to specify the key you mentioned.
Done.
Why do you keep the mtime-based format as an option? (Maybe because
it's
faster? Did you measure it?)
I haven't actually measured anything, but stating a file will definitely be faster than reading it completely and hashing it. I suppose if the speed difference between timestamp-based and hash-based pycs turned out to be small we could feel good about dropping the timestamp format completely. However, that difference might be hard to determine definitely as I expect the speed hit will vary widely based on system parameters such as disk speed and page cache size.
My goal in this PEP was to preserve the current pyc invalidation behavior, which works well today for many use cases, as the default. The hash-based pycs are reserved for distribution and other power use cases.
OK, maybe you can clarify that a bit in the PEP.
I've added a paragraph to the Rationale section. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg%40krypto.org