[Python-ideas] Add a cryptographic hash (e.g SHA1) of source toPython Compiled objects?

Thu Feb 5 14:40:44 CET 2009

Brett Cannon writes:
 > On Wed, Feb 4, 2009 at 01:57,  <rocky at gnu.org> wrote:
 > > Terry Reedy writes:
 > >  > rocky at gnu.org wrote:
 > >  >
 > >  > > Without a doubt you all are much more familiar at this stuff that I
 > >  > > am. (In fact I'm a rank novice.) So I'd be grateful if someone would
 > >  > > post code for a function say:
 > >  > >
 > >  > >   compare_src_obj(python_src, python_obj)
 > >  > >
 > >  > > that takes two strings -- a Python source filename and a Python object
 > >  > > -- does what's outlined above, and returns a status which indicates
 > >  > > the same or if not and if not whether the difference is because of the
 > >  > > wrong version of Python was used.
 > >  >
 > >  > Interesting question.  For equaility, I would start with, just guessing
 > >  > a bit:
 > >  >
 > >  > marshal(compile(open(file.py).read())) == open(file.pyc).read()
 > >  >
 > >  > Specifically for version clash, I believe the first 4 bytes are a magic
 > >  > version number.  If that is not part of the marshal string, it would
 > >  > need to be skipped for the equality comparison.
 > >
 > > There's also the mtime that needs to be ignored mentioned in prior
 > > posts. And is there a table which converts a magic number version back
 > > into a string with the Python version number? Thanks.
 > 
 > marshal.dumps(compile(open('file.py').read(), 'file.py', 'exec')) ==
 > open('file.pyc').read()[8:]

Thanks. 

Alas, I can't see how in practice this will be generally useful.

Again, here is the problem: I have a some sort of compiled python file
and something which I think is the source code for it. I want to
verify that it is.

(In a debugger it means we can warn that what you are seeing is not
what's being run. However I do not believe this is the only situation
where getting the answer to this question is helpful/important.)

The solution above is very sensitive to knowing the name of the file
(files?) used in compilation because those are stored in the
co_filename portion of the code object. 

For example if what's stored in that field is 'foo.py' but I compile
with the name './foo.py' or some other equivalent name, then I get a
false mismatch. Worse, as we've seen before when dealing with zipped
eggs, the name stored in co_filename is a somewhat temporary location
and something very few people are going to guess or recognize as the
location of where they think the file originated.

What seems to me to be a weakness of this approach is that it requires
that you get two additional pieces of information correct that really
are irrelevant from the standpoint of the problem: the name of the
file and the version of Python used in the compilation process. I just
care about the source text.

As I write this I can't help but be amused me, because when before on
pydthon-dev I asked about how I could get more accurate file names in
co_filename (for zipped eggs), the answer invariably offered was
something along the lines "why not use the source text?"