On 7 September 2017 at 16:58, Gregory P. Smith <greg@krypto.org> wrote:
+1 on this PEP.
The TL;DR summary of this PEP: The pyc date+length metadata check was a convenient hack. It still works well for many people and use cases, it isn't going away. PEP 552 proposes a new alternate hack that relies on file contents instead of os and filesystem date metadata. Assumption: The hash function is significantly faster than re-parsing the source. (guaranteed to be true)
Questions:
Input from OS package distributors would be interesting. Would they use this? Which way would it impact their startup time (loading the .py file vs just statting it. does that even matter? source files are often eventually loaded for linecache use in tracebacks anyways)?
Christian and I asked some of our security folks for their personal wishlists recently, and one of the items that came up was "The recompile is based on a timestamp. How do you know the pyc file on disk really is related to the py file that is human readable? Can it be based on a hash or something like that?" This is a restating of the reproducible build use case: for a given version of Python, a given source file should always give the same source hash and marshaled code object, and once it does, it's easier to do an independent compilation from the source file and check you get the same answer. While you can implement that for timestamp based formats by adjusting input file metadata (and that's exactly what distros do with _SOURCE_DATE_EPOCH), it's still pretty annoying, and not particularly build cache friendly, since the same file in different source artifacts may produce different build outputs.
Would they benefit from a pyc that can contain _both_ timestamp+length, and the source_hash? if both were present, I assume that only one would be checked at startup. i'm not sure what would make the decision of what to check. one fails, check the other? i personally do not have a use for this case so i'd omit the complexity without a demonstrated need.
I don't see any way we'd benefit from having both items present. However, I do wonder whether we could encode *all* the mode settings into the magic number, such that we did something like reserving the top 3 bits for format flags: * number & 0x1FFF -> the traditional magic number * number & 0x8000 -> timestamp or hash? * number & 0x4000 -> checked or not? * number & 0x2000 -> reserved for future format changes By default we'd still produce the checked-timestamp format, but managed build systems (including Linux distros) could opt-in to the unchecked-hash format.
Something to also state in the PEP:
This is intentionally not a "secure" hash. Security is explicitly a non-goal.
I don't think it's so much that security is a non-goal, as that the (admittedly minor) security improvement comes from making it easier to reproduce the expected machine-readable output from a given human-readable input, rather than from the nature of the hashing function used.
Rationale behind my support:
+1 from me as well, for the reasons Greg gives (while Fedora doesn't currently do any per-file build artifact caching, I hope we will in the future, and output formats based on input artifact hashes will make that much easier than formats based on input timestamps). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia