Add a cryptographic hash (e.g SHA1) of source to Python Compiled objects?

I've been re-examining from ground up the whole state of affairs in writing a debugger. One of the challenges of a debugger or any source-code analysis tool is verifying that the source-code that the tool is reporting on corresponds to the compiled object under execution. For debuggers, this problem becomes more likely to occur when you are debugging on a computer that isn't the same as the computer where the code is running.) For this, it would be useful to have a cryptographic hash like a SHA1 in the compiled object, but hopefully accessible via the module object where the file path is stored. I understand that there is a mtime timestamp in the .pyc but this is not as reliable as cryptographic hash such as SHA1. There seems to be some confusion in thinking the only use case for this is in remote debugging where source code may be on a different computer than where the code is running, but I do not believe this is so. Here are two other situations which come up. First is a code coverage tool like coverage.py which checks coverage over several runs. Let's say the source code is erased and checked out again; or edited and temporarily changed several times but in the end the file stays the same. A SHA1 has will understand the file hasn't changed, mtime won't. A second more contrived example is in in some sort of secure environment. Let's say I am using the compiled Python code, (say for an embedded device) and someone offers me what's purported to be the source code. How can I easily verify that this is correct? In theory I suppose if I have enough information about the version of Python and which platform, I can compile the purported source ignoring some bits of information (like the mtime ;-) in the compiled object. But one would have to be careful about getting compilers and platforms then same or understand how this changes compilation.

On Tue, Feb 3, 2009 at 01:56, <rocky@gnu.org> wrote:
Well, whatever solution you propose would need to have this signing be optional since it is in no way required in day-to-day executions. The overhead of calculating the hash is not worth the benefit in the general case.
That's seems somewhat contrived. Assuming you do not have coverage as part of your continuous build process is having a couple of files have to be covered again that expensive? And if you were mucking with the files you might want to make sure that you really did not change something.
I really do not see that situation ever coming up.
The only thing you need to compile bytecode is the same Python version (and thus the same magic number) and whether it is a .pyc or .pyo (and thus if -O/-OO was used). Your platform has nothing to do with bytecode compilation. -Brett

On Tue, Feb 3, 2009 at 11:44, Raymond Hettinger <python@rcn.com> wrote:
Actually, recompilation is so cheap and easy (see py_compile or compileall with the --force flag) that bothering with the hash is probably not worth it. Might as well recompile the thing and just see if the strings are equivalent. Unless you are doing this for a ton of files the speed difference, while possibly relatively huge, in absolute terms is still really small and thus probably not worth the added complexity of the hashes. -Brett

Brett Cannon writes:
Ok. I'm now enlightened as to a viable approach. Thanks. Without a doubt you all are much more familiar at this stuff that I am. (In fact I'm a rank novice.) So I'd be grateful if someone would post code for a function say: compare_src_obj(python_src, python_obj) that takes two strings -- a Python source filename and a Python object -- does what's outlined above, and returns a status which indicates the same or if not and if not whether the difference is because of the wrong version of Python was used. (The use case given was little more involved than this because one needs to make sure Python is found for the same version as the given compiled object; but to start out I'd be happy to skip that complexity, if it's too much trouble to do.)

rocky@gnu.org wrote:
Interesting question. For equaility, I would start with, just guessing a bit: marshal(compile(open(file.py).read())) == open(file.pyc).read() Specifically for version clash, I believe the first 4 bytes are a magic version number. If that is not part of the marshal string, it would need to be skipped for the equality comparison.

Terry Reedy writes:
There's also the mtime that needs to be ignored mentioned in prior posts. And is there a table which converts a magic number version back into a string with the Python version number? Thanks.

On Wed, Feb 4, 2009 at 02:18, Arnaud Delobelle <arnodel@googlemail.com> wrote:
The other option to see how all of this works is importlib as found in the py3k branch. That's in pure Python so it's easier to follow. -Brett

Brett Cannon writes:
Sorry for the delayed response - I finally had a chance to check out the py3k code and look. Perhaps I'm missing something. Although there is some really cool, well-written and neat Python code there (and some of the private methods there seem to me like they should public and somewhere else, perhaps in os or os.path), I don't see a table mapping magic numbers to a string containing a Python version as you would find when running "python -V" and that's what was kind of asked for. As Arnaud mentioned, Python/import.c has this magic-number mapping in comments near the top of the file. Of course one could take those comments and turn it into a dictionary, but I was hoping Python had such a dictionary/function built in already since needs to be maintained along with changes to the magic number. Thanks.

On Thu, Feb 5, 2009 at 19:38, <rocky@gnu.org> wrote:
Still working on exposing the API.
perhaps in os or os.path),
Nothing in that module belongs in os.
Sorry, misread the email. Python/import.c is the right place then.
It actually doesn't need to be maintained. If the magic number doesn't match from a .pyc then it needs to be regenerated, period. We do not try to see if the magic number can be different yet compatible with the running interpreter. And as for changing it, it is simply a specific increment along with committing the file. The magic number history is documented in that file "just in case". -Brett

Brett Cannon writes:
There's probably some confusion as to what I was referring to or what I took you to mean when you mentioned importlib. I took that to mean the files in that directory "importlib". At any rate that's what I looked at. One of the files is _bootstrap.py which has: def _path_join(*args): """Replacement for os.path.join.""" return path_sep.join(x[:-len(path_sep)] if x.endswith(path_sep) else x for x in args) def _path_exists(path): """Replacement for os.path.exists.""" try: _os.stat(path) except OSError: return False else: return True def _path_is_mode_type(path, mode): """Test whether the path is the specified mode type.""" try: stat_info = _os.stat(path) except OSError: return False return (stat_info.st_mode & 0o170000) == mode For _path_join, posixpath.py has something similar and perhaps even the same functionality although it's different code. _path_is_mode_type doesn't exist in posixpath.py _path_exists seems to be almost a duplicate of lexists using which uses lstat instead of _os.stat.
I meant the mapping between magic number and version that it represents. For a use case, recall again what the problem is: you are given python code and a file that purports to be the source and want to verify. The current proposal (with its current weaknesses) requires getting the compiler the same. When that's not the same one could say "sorry, Python compiler version mismatch -- go figure it out", but more helpful would be to indicate that you compiled with version X (as a string rather than a magic number) and the python code was compiled with version Y. This means the source might be the same, we just don't really know.

On Fri, Feb 6, 2009 at 10:58, <rocky@gnu.org> wrote:
No, that's right.
All of that code is duplicated, most of it copy-and-paste, from some code from the os module or its helper modules. The only reason it is there is for bootstrapping reasons when that code will be used as the implementation of import (module can't rely on non-builtin modules).
I still don't see the benefit of knowing what version of Python a magic number matches to. So I know some bytecode was compiled by Python 2.5 while I am running Python 2.6. What benefit do I derive from knowing that compared to just knowing that it was not compiled by Python 2.6? I mean are you ultimately planning on launching a different interpreter based on what generated the bytecode? -Brett

Yep. Not uncommon for me to have several versions of Python available. It so happens that the computer where this email is being sent has at least 9 versions, possibly more because I didn't check if python.new and python.old are one those other 9. (I don't maintain this box, but pay someone else to; clearly this is a pathological case, but it's kind of interesting to me that there are at more than 9 versions installed and I did not contrive this case.)
If there's a mismatch in the first place, it means there's confusion on someone's part. Don't you want to foster development of programs that try to minimize confusion? Subsidiary effects when support of magic to version string are not readily available in situations where it would be helpful is possibly back and forth dialog in bug reports one is asking what telling folks how to get the version number (because it's not in the error message because its not readily available by a programmer). Never underestimate the level of users, especially if you are working on something like a debugger. If we hope that someone's going to know about and read that comment in the C file turn it into a dictionary and maintain it anytime the magic number gets updated, it's probably not going to happen often. Again, although I see specific uses in a debugger this really an issue regarding code tools or programs that deal with Python code. I know there's a disassembler, but you mean there isn't a dump tool which shows all of the information inside a compiled object including a disassembly, Python version the program was compiled with, mtime in human readable format, and whatnot?

On Fri, Feb 6, 2009 at 12:10, <rocky@gnu.org> wrote:
Come on, that is such a baiting question. You view adding a dict of the versions as a way to help deal with confusion in a case where someone actually cares about which version of bytecode is used. I view it as another API someone is going to have to maintain for a use case I do not see as justifying that maintenance. Bytecode is purely a performance benefit, nothing more. This is why we so readily reconstruct it. Heck, in Python 3.0 the __file__ attribute *always* points to the .py file even if the .pyc was used for the load.
I have never had a bug report come in where I had to care about he magic number of a .pyc file.
Never underestimate the level of users, especially if you are working on something like a debugger.
I don't, else I would not be a Python developer. But along with not underestimating also means that if you need to worry about something like what version of Python generates what magic number then you can look at Python/compile.c just as easily without me adding some code that has to be maintained.
Nope, it probably won't, and honestly I am fine with that.
Just so there is no confusion: a .pyc is not a compiled object, but a file containing a magic number, mtime, and a marshaled code object. And no, there is nothing in the standard library that dumps all of this data out for a .pyc. This is somewhat on purpose as we make no hard guarantees we won't change the format of .pyc files at some point (see the python-ideas list for a discussion about changing the format to allowing a variable amount of metadata). -Brett

Brett Cannon writes:
Thanks. Alas, I can't see how in practice this will be generally useful. Again, here is the problem: I have a some sort of compiled python file and something which I think is the source code for it. I want to verify that it is. (In a debugger it means we can warn that what you are seeing is not what's being run. However I do not believe this is the only situation where getting the answer to this question is helpful/important.) The solution above is very sensitive to knowing the name of the file (files?) used in compilation because those are stored in the co_filename portion of the code object. For example if what's stored in that field is 'foo.py' but I compile with the name './foo.py' or some other equivalent name, then I get a false mismatch. Worse, as we've seen before when dealing with zipped eggs, the name stored in co_filename is a somewhat temporary location and something very few people are going to guess or recognize as the location of where they think the file originated. What seems to me to be a weakness of this approach is that it requires that you get two additional pieces of information correct that really are irrelevant from the standpoint of the problem: the name of the file and the version of Python used in the compilation process. I just care about the source text. As I write this I can't help but be amused me, because when before on pydthon-dev I asked about how I could get more accurate file names in co_filename (for zipped eggs), the answer invariably offered was something along the lines "why not use the source text?"

Terry Reedy writes:
Alas, I suspect going down this path will lead to plugging more and more leaks. It is not just co_filename that might need ignoring, possibly artifacts from __file__ as well. When I run this Python program: import marshal print marshal.dumps(compile(open(__file__).read(), __file__, 'exec')) which I store in "/tmp/foo.py", and I look at the output, I see the string "/tmp/foo.py". No doubt this comes from whereever the value of __file__ is stored which seems computed at compile time. So one would probably need to ignore the values of __file__ variables, where ever that is stored. That said, if someone can write such a program I'd appreciate it.

On Tue, Feb 3, 2009 at 01:56, <rocky@gnu.org> wrote:
Well, whatever solution you propose would need to have this signing be optional since it is in no way required in day-to-day executions. The overhead of calculating the hash is not worth the benefit in the general case.
That's seems somewhat contrived. Assuming you do not have coverage as part of your continuous build process is having a couple of files have to be covered again that expensive? And if you were mucking with the files you might want to make sure that you really did not change something.
I really do not see that situation ever coming up.
The only thing you need to compile bytecode is the same Python version (and thus the same magic number) and whether it is a .pyc or .pyo (and thus if -O/-OO was used). Your platform has nothing to do with bytecode compilation. -Brett

On Tue, Feb 3, 2009 at 11:44, Raymond Hettinger <python@rcn.com> wrote:
Actually, recompilation is so cheap and easy (see py_compile or compileall with the --force flag) that bothering with the hash is probably not worth it. Might as well recompile the thing and just see if the strings are equivalent. Unless you are doing this for a ton of files the speed difference, while possibly relatively huge, in absolute terms is still really small and thus probably not worth the added complexity of the hashes. -Brett

Brett Cannon writes:
Ok. I'm now enlightened as to a viable approach. Thanks. Without a doubt you all are much more familiar at this stuff that I am. (In fact I'm a rank novice.) So I'd be grateful if someone would post code for a function say: compare_src_obj(python_src, python_obj) that takes two strings -- a Python source filename and a Python object -- does what's outlined above, and returns a status which indicates the same or if not and if not whether the difference is because of the wrong version of Python was used. (The use case given was little more involved than this because one needs to make sure Python is found for the same version as the given compiled object; but to start out I'd be happy to skip that complexity, if it's too much trouble to do.)

rocky@gnu.org wrote:
Interesting question. For equaility, I would start with, just guessing a bit: marshal(compile(open(file.py).read())) == open(file.pyc).read() Specifically for version clash, I believe the first 4 bytes are a magic version number. If that is not part of the marshal string, it would need to be skipped for the equality comparison.

Terry Reedy writes:
There's also the mtime that needs to be ignored mentioned in prior posts. And is there a table which converts a magic number version back into a string with the Python version number? Thanks.

On Wed, Feb 4, 2009 at 02:18, Arnaud Delobelle <arnodel@googlemail.com> wrote:
The other option to see how all of this works is importlib as found in the py3k branch. That's in pure Python so it's easier to follow. -Brett

Brett Cannon writes:
Sorry for the delayed response - I finally had a chance to check out the py3k code and look. Perhaps I'm missing something. Although there is some really cool, well-written and neat Python code there (and some of the private methods there seem to me like they should public and somewhere else, perhaps in os or os.path), I don't see a table mapping magic numbers to a string containing a Python version as you would find when running "python -V" and that's what was kind of asked for. As Arnaud mentioned, Python/import.c has this magic-number mapping in comments near the top of the file. Of course one could take those comments and turn it into a dictionary, but I was hoping Python had such a dictionary/function built in already since needs to be maintained along with changes to the magic number. Thanks.

On Thu, Feb 5, 2009 at 19:38, <rocky@gnu.org> wrote:
Still working on exposing the API.
perhaps in os or os.path),
Nothing in that module belongs in os.
Sorry, misread the email. Python/import.c is the right place then.
It actually doesn't need to be maintained. If the magic number doesn't match from a .pyc then it needs to be regenerated, period. We do not try to see if the magic number can be different yet compatible with the running interpreter. And as for changing it, it is simply a specific increment along with committing the file. The magic number history is documented in that file "just in case". -Brett

Brett Cannon writes:
There's probably some confusion as to what I was referring to or what I took you to mean when you mentioned importlib. I took that to mean the files in that directory "importlib". At any rate that's what I looked at. One of the files is _bootstrap.py which has: def _path_join(*args): """Replacement for os.path.join.""" return path_sep.join(x[:-len(path_sep)] if x.endswith(path_sep) else x for x in args) def _path_exists(path): """Replacement for os.path.exists.""" try: _os.stat(path) except OSError: return False else: return True def _path_is_mode_type(path, mode): """Test whether the path is the specified mode type.""" try: stat_info = _os.stat(path) except OSError: return False return (stat_info.st_mode & 0o170000) == mode For _path_join, posixpath.py has something similar and perhaps even the same functionality although it's different code. _path_is_mode_type doesn't exist in posixpath.py _path_exists seems to be almost a duplicate of lexists using which uses lstat instead of _os.stat.
I meant the mapping between magic number and version that it represents. For a use case, recall again what the problem is: you are given python code and a file that purports to be the source and want to verify. The current proposal (with its current weaknesses) requires getting the compiler the same. When that's not the same one could say "sorry, Python compiler version mismatch -- go figure it out", but more helpful would be to indicate that you compiled with version X (as a string rather than a magic number) and the python code was compiled with version Y. This means the source might be the same, we just don't really know.

On Fri, Feb 6, 2009 at 10:58, <rocky@gnu.org> wrote:
No, that's right.
All of that code is duplicated, most of it copy-and-paste, from some code from the os module or its helper modules. The only reason it is there is for bootstrapping reasons when that code will be used as the implementation of import (module can't rely on non-builtin modules).
I still don't see the benefit of knowing what version of Python a magic number matches to. So I know some bytecode was compiled by Python 2.5 while I am running Python 2.6. What benefit do I derive from knowing that compared to just knowing that it was not compiled by Python 2.6? I mean are you ultimately planning on launching a different interpreter based on what generated the bytecode? -Brett

Yep. Not uncommon for me to have several versions of Python available. It so happens that the computer where this email is being sent has at least 9 versions, possibly more because I didn't check if python.new and python.old are one those other 9. (I don't maintain this box, but pay someone else to; clearly this is a pathological case, but it's kind of interesting to me that there are at more than 9 versions installed and I did not contrive this case.)
If there's a mismatch in the first place, it means there's confusion on someone's part. Don't you want to foster development of programs that try to minimize confusion? Subsidiary effects when support of magic to version string are not readily available in situations where it would be helpful is possibly back and forth dialog in bug reports one is asking what telling folks how to get the version number (because it's not in the error message because its not readily available by a programmer). Never underestimate the level of users, especially if you are working on something like a debugger. If we hope that someone's going to know about and read that comment in the C file turn it into a dictionary and maintain it anytime the magic number gets updated, it's probably not going to happen often. Again, although I see specific uses in a debugger this really an issue regarding code tools or programs that deal with Python code. I know there's a disassembler, but you mean there isn't a dump tool which shows all of the information inside a compiled object including a disassembly, Python version the program was compiled with, mtime in human readable format, and whatnot?

On Fri, Feb 6, 2009 at 12:10, <rocky@gnu.org> wrote:
Come on, that is such a baiting question. You view adding a dict of the versions as a way to help deal with confusion in a case where someone actually cares about which version of bytecode is used. I view it as another API someone is going to have to maintain for a use case I do not see as justifying that maintenance. Bytecode is purely a performance benefit, nothing more. This is why we so readily reconstruct it. Heck, in Python 3.0 the __file__ attribute *always* points to the .py file even if the .pyc was used for the load.
I have never had a bug report come in where I had to care about he magic number of a .pyc file.
Never underestimate the level of users, especially if you are working on something like a debugger.
I don't, else I would not be a Python developer. But along with not underestimating also means that if you need to worry about something like what version of Python generates what magic number then you can look at Python/compile.c just as easily without me adding some code that has to be maintained.
Nope, it probably won't, and honestly I am fine with that.
Just so there is no confusion: a .pyc is not a compiled object, but a file containing a magic number, mtime, and a marshaled code object. And no, there is nothing in the standard library that dumps all of this data out for a .pyc. This is somewhat on purpose as we make no hard guarantees we won't change the format of .pyc files at some point (see the python-ideas list for a discussion about changing the format to allowing a variable amount of metadata). -Brett

Brett Cannon writes:
Thanks. Alas, I can't see how in practice this will be generally useful. Again, here is the problem: I have a some sort of compiled python file and something which I think is the source code for it. I want to verify that it is. (In a debugger it means we can warn that what you are seeing is not what's being run. However I do not believe this is the only situation where getting the answer to this question is helpful/important.) The solution above is very sensitive to knowing the name of the file (files?) used in compilation because those are stored in the co_filename portion of the code object. For example if what's stored in that field is 'foo.py' but I compile with the name './foo.py' or some other equivalent name, then I get a false mismatch. Worse, as we've seen before when dealing with zipped eggs, the name stored in co_filename is a somewhat temporary location and something very few people are going to guess or recognize as the location of where they think the file originated. What seems to me to be a weakness of this approach is that it requires that you get two additional pieces of information correct that really are irrelevant from the standpoint of the problem: the name of the file and the version of Python used in the compilation process. I just care about the source text. As I write this I can't help but be amused me, because when before on pydthon-dev I asked about how I could get more accurate file names in co_filename (for zipped eggs), the answer invariably offered was something along the lines "why not use the source text?"

Terry Reedy writes:
Alas, I suspect going down this path will lead to plugging more and more leaks. It is not just co_filename that might need ignoring, possibly artifacts from __file__ as well. When I run this Python program: import marshal print marshal.dumps(compile(open(__file__).read(), __file__, 'exec')) which I store in "/tmp/foo.py", and I look at the output, I see the string "/tmp/foo.py". No doubt this comes from whereever the value of __file__ is stored which seems computed at compile time. So one would probably need to ignore the values of __file__ variables, where ever that is stored. That said, if someone can write such a program I'd appreciate it.
participants (5)
-
Arnaud Delobelle
-
Brett Cannon
-
Raymond Hettinger
-
rocky@gnu.org
-
Terry Reedy