add a hash to .pyc to don't mess between .py and .pyc

I have stumbled upon several time with the following problem. I delete a module and the .pyc stay around. and by "magic", python still use the .pyc A similar error happen (but less often) when by some file system manipulation the .pyc happen to be newer than the .py but correspond to an older version of .py. It is not a major problem but it is still an existing problem. I'm not the first one to have this problem. A stack overflow search lead to quite a lot of relevant answers http://stackoverflow.com/search?q=old+pyc and google search too https://www.google.fr/search?q=old+pyc moreover several result of google result in bug tracking of various project. (There is also in these result the fact that .pyc are stored in VCS repositories but this is another problem not related) I even found a blog post using .pyc as a backdoor http://secureallthethings.blogspot.fr/2015/11/backdooring-python-via-pyc-pi-... My idea to kill both bird in one stone would be to add a hash (likely to be cryptographic) of the .py file in the .pyc file and read the .py file and check the hash The additional cost of first startup cost will be just the hash calculation which I think is cheap comparing to other factors (especially input output) The additional second startup cost of a program the main cost will be the additional read of .py files and the cheap hash calculations. I believe the removing of the bugs would worth the performance cost. I know that some use case makes a use of just using .pyc and not keeping .py around, for example by not distribute the source file. But in my vision, this uses case should be solved per opt-in decision and not as a default. Several opt-in mechanisms could be envisioned: environment variables, command line switches, special compilation of .pyc which explicitly ask to not check for the hash. -- Xavier

On Mon, Aug 15, 2016 at 9:05 AM, Xavier Combelle <xavier.combelle@gmail.com> wrote:
Of those, only the last one is truly viable - the application developer isn't necessarily the one choosing to make a sourceless module (it could be any library module anywhere in the tree, including the CPython standard library - sometimes that's distributed without .py files, to reduce interpreter on-disk size). So what this would mean is that a sourceless distro is not simply "delete the .py files and stuff keeps working", but "run this script and it'll recompile the .py files to stand-alone .pyc files". As such, I think the idea has merit; but it won't close the backdoor that you mentioned (anyone who wants to make that kind of attack would simply make a file that's marked as stand-alone). That said, though - anyone who can maliciously write to your file system has already won, whether they're writing pyc or py files. The only difference is how easily it's detected. Fully loading and hashing the .py file seems like a paranoia option, and if you want that, just blow away all .pyc files, have your PYTHONPATH point to a read-only file system, and force the interpreter to compile everything fresh every time. How does this interact with the __pycache__ directory? ChrisA

You can add a `make clean` build step: pyclean: find . -name '*.pyc' -delete You can delete all .pyc files - $ find . -name '*.pyc' -delete - http://manpages.ubuntu.com/manpages/precise/man1/pyclean.1.html #.pyc, .pyo You can rebuild all .pyc files (for a given directory): - $ python -m compileall -h - https://docs.python.org/2/library/compileall.html - https://docs.python.org/3/library/compileall.html You can, instead of building .pyc, build .pyo - https://docs.python.org/2/using/cmdline.html#envvar-PYTHONOPTIMIZE - https://docs.python.org/2/using/cmdline.html#cmdoption-O You can not write .pyc or .pyo w/ PYTHONDONTWRITEBYTECODE / -B - https://docs.python.org/2/using/cmdline.html#envvar-PYTHONDONTWRITEBYTECODE - https://docs.python.org/2/using/cmdline.html#cmdoption-B - If the files exist though, - https://docs.python.org/3/reference/import.html You can build a PEX (which rebuilds .pyc files) and test/deploy that: - https://github.com/pantsbuild/pex#integrating-pex-into-your-workflow - https://pantsbuild.github.io/python-readme.html#more-about-python-tests How .pyc files currently work: - http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html - https://www.python.org/dev/peps/pep-3147/#flow-chart (*.pyc -> ./__pycache__) - http://raulcd.com/how-python-caches-compiled-bytecode.html You could add a hash of the .py source file in the header of the .pyc/.pyo object (as proposed) - The overhead of this hashing would be a significant performance regression - Instead, today, the build step can just pyclean or build a .zip/.WHL/.PEX which is expected to be a fresh build On Sun, Aug 14, 2016 at 6:23 PM, Chris Angelico <rosuav@gmail.com> wrote:

On 15/08/2016 02:45, Wes Turner wrote:
The problem is not the option of you have to prevent the problem, the simplest way being to delete the .pyc file, It is easy to do once you spot it. The problem is that it randomly happen in normal workflow. To have an idea of the overhead of the whole hashing procedure I run the following script import sys from time import time from zlib import adler32 as h t2 =time() import decimal print(decimal.__file__) c1 = time()-t2 t1=time() r=h(open(decimal.__file__,'rb').read()) c2= time()-t1 print(c2,c1,c2/c1) decimal was chosen because it was the biggest file of the standard library. on 20 runs, the overhead was always between 1% and 1.5% So yes the overhead on the import process is measurable but very small. By consequence, I would not call it significant. Moreover the import process is only a part (and not the biggest one) of a whole. At the difference of my first mail I now consider only a non cryptographic hash/checksum as the only aim is to prevent accidental unmatch between .pyc and .py file.

On Sun, Aug 14, 2016 at 9:35 PM, Xavier Combelle <xavier.combelle@gmail.com> wrote:
IIUC, the timestamp in the .pyc header is designed to prevent this ocurrence? Reasons that the modification timestamp comparison could be off: - Time change - Daylight savings time - NTP drift adjustment?
I agree that 1 to 1.5% is not significant.

On Mon, Aug 15, 2016 at 01:05:47AM +0200, Xavier Combelle wrote:
Upgrade to Python 3.2 or better, and the problem will go away. In 3.2 and above, the .pyc files are stored in a separate __pycache__ directory, and are only used if the .py file still exists. In Python 3.1 and older, you have: # directory in sys.path spam.py spam.pyc eggs.py eggs.pyc and if you delete eggs.py, Python will still use eggs.pyc. But in 3.2 and higher the cache keeps implementation and version specific byte-code files: spam.py eggs.py __pycache__/ +-- spam-cpython-32.pyc +-- spam-cpython-35.pyc +-- spam-pypy-33.pyc +-- eggs-cpython-34.pyc +-- eggs-cpython-35.pyc If you delete the eggs.py file, the eggs byte-code files won't be used. Byte-code only modules are still supported, but you have to explicitly opt-in to that by moving the .pyc file out of the __pycache__ directory and renaming it. See PEP 3147 for more details: https://www.python.org/dev/peps/pep-3147/ -- Steve

The purpose of .pyc is to optmize python. With your proposed change, the number of syscalls is doubled (open, read, close) and you add extra work (compute hash) when .pyc is used. If your filesystem works correctly, you should not have to bother. Victor Le 15 août 2016 01:06, "Xavier Combelle" <xavier.combelle@gmail.com> a écrit :

On Mon, Aug 15, 2016 at 9:05 AM, Xavier Combelle <xavier.combelle@gmail.com> wrote:
Of those, only the last one is truly viable - the application developer isn't necessarily the one choosing to make a sourceless module (it could be any library module anywhere in the tree, including the CPython standard library - sometimes that's distributed without .py files, to reduce interpreter on-disk size). So what this would mean is that a sourceless distro is not simply "delete the .py files and stuff keeps working", but "run this script and it'll recompile the .py files to stand-alone .pyc files". As such, I think the idea has merit; but it won't close the backdoor that you mentioned (anyone who wants to make that kind of attack would simply make a file that's marked as stand-alone). That said, though - anyone who can maliciously write to your file system has already won, whether they're writing pyc or py files. The only difference is how easily it's detected. Fully loading and hashing the .py file seems like a paranoia option, and if you want that, just blow away all .pyc files, have your PYTHONPATH point to a read-only file system, and force the interpreter to compile everything fresh every time. How does this interact with the __pycache__ directory? ChrisA

You can add a `make clean` build step: pyclean: find . -name '*.pyc' -delete You can delete all .pyc files - $ find . -name '*.pyc' -delete - http://manpages.ubuntu.com/manpages/precise/man1/pyclean.1.html #.pyc, .pyo You can rebuild all .pyc files (for a given directory): - $ python -m compileall -h - https://docs.python.org/2/library/compileall.html - https://docs.python.org/3/library/compileall.html You can, instead of building .pyc, build .pyo - https://docs.python.org/2/using/cmdline.html#envvar-PYTHONOPTIMIZE - https://docs.python.org/2/using/cmdline.html#cmdoption-O You can not write .pyc or .pyo w/ PYTHONDONTWRITEBYTECODE / -B - https://docs.python.org/2/using/cmdline.html#envvar-PYTHONDONTWRITEBYTECODE - https://docs.python.org/2/using/cmdline.html#cmdoption-B - If the files exist though, - https://docs.python.org/3/reference/import.html You can build a PEX (which rebuilds .pyc files) and test/deploy that: - https://github.com/pantsbuild/pex#integrating-pex-into-your-workflow - https://pantsbuild.github.io/python-readme.html#more-about-python-tests How .pyc files currently work: - http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html - https://www.python.org/dev/peps/pep-3147/#flow-chart (*.pyc -> ./__pycache__) - http://raulcd.com/how-python-caches-compiled-bytecode.html You could add a hash of the .py source file in the header of the .pyc/.pyo object (as proposed) - The overhead of this hashing would be a significant performance regression - Instead, today, the build step can just pyclean or build a .zip/.WHL/.PEX which is expected to be a fresh build On Sun, Aug 14, 2016 at 6:23 PM, Chris Angelico <rosuav@gmail.com> wrote:

On 15/08/2016 02:45, Wes Turner wrote:
The problem is not the option of you have to prevent the problem, the simplest way being to delete the .pyc file, It is easy to do once you spot it. The problem is that it randomly happen in normal workflow. To have an idea of the overhead of the whole hashing procedure I run the following script import sys from time import time from zlib import adler32 as h t2 =time() import decimal print(decimal.__file__) c1 = time()-t2 t1=time() r=h(open(decimal.__file__,'rb').read()) c2= time()-t1 print(c2,c1,c2/c1) decimal was chosen because it was the biggest file of the standard library. on 20 runs, the overhead was always between 1% and 1.5% So yes the overhead on the import process is measurable but very small. By consequence, I would not call it significant. Moreover the import process is only a part (and not the biggest one) of a whole. At the difference of my first mail I now consider only a non cryptographic hash/checksum as the only aim is to prevent accidental unmatch between .pyc and .py file.

On Sun, Aug 14, 2016 at 9:35 PM, Xavier Combelle <xavier.combelle@gmail.com> wrote:
IIUC, the timestamp in the .pyc header is designed to prevent this ocurrence? Reasons that the modification timestamp comparison could be off: - Time change - Daylight savings time - NTP drift adjustment?
I agree that 1 to 1.5% is not significant.

On Mon, Aug 15, 2016 at 01:05:47AM +0200, Xavier Combelle wrote:
Upgrade to Python 3.2 or better, and the problem will go away. In 3.2 and above, the .pyc files are stored in a separate __pycache__ directory, and are only used if the .py file still exists. In Python 3.1 and older, you have: # directory in sys.path spam.py spam.pyc eggs.py eggs.pyc and if you delete eggs.py, Python will still use eggs.pyc. But in 3.2 and higher the cache keeps implementation and version specific byte-code files: spam.py eggs.py __pycache__/ +-- spam-cpython-32.pyc +-- spam-cpython-35.pyc +-- spam-pypy-33.pyc +-- eggs-cpython-34.pyc +-- eggs-cpython-35.pyc If you delete the eggs.py file, the eggs byte-code files won't be used. Byte-code only modules are still supported, but you have to explicitly opt-in to that by moving the .pyc file out of the __pycache__ directory and renaming it. See PEP 3147 for more details: https://www.python.org/dev/peps/pep-3147/ -- Steve

The purpose of .pyc is to optmize python. With your proposed change, the number of syscalls is doubled (open, read, close) and you add extra work (compute hash) when .pyc is used. If your filesystem works correctly, you should not have to bother. Victor Le 15 août 2016 01:06, "Xavier Combelle" <xavier.combelle@gmail.com> a écrit :
participants (6)
-
Chris Angelico
-
David Mertz
-
Steven D'Aprano
-
Victor Stinner
-
Wes Turner
-
Xavier Combelle