[Python-ideas] AST Hash

Thu Sep 12 07:12:23 CEST 2013

From: anatoly techtonik <techtonik at gmail.com>

Sent: Wednesday, September 11, 2013 9:05 AM

> We need a checksum for code pieces. The goal of the checksum is to
> reliably detect pieces of code with absolutely identical behaviour.
> Borders of such checksum can be functions, classes, modules,.

Cool idea.

But why not also fragments? A single expression, statement, or suite could be useful in many contexts without having to artificially wrap it in a function, couldn't it?

> Practical application for such checksums are:
> 
> - detecting usage of recipes and examples across PyPI packages
> - detecting usage of standard stdlib calls
> - creating execution safe serialization formats for data
>    - choosing class to deserialize data fields of the object based on its hash
> - enable consistent validation and testing of results across various AST tools
> 
> There can be two approaches to build such checksum:
> 1. Code Section Hash
> 2. AST Hash

I'm not sure either of these is right.

You want to treat two functions as equal if they have renamed locals:

    def f1():
        i = 0
        return i
    def f2():
        j = 0
        return j

But I'm guessing you _don't_ want to treat them as equal if they reference different globals:

    i, j = 0, 1
    def f1():
        return i
    def f2():
        return j

If it's not obvious why you want those two to be different, consider this identical case:

    def f1():
        return os.open('foo')
    def f2():
        return zipfile.open('foo')

But the difference in the ASTs in this case looks identical to the difference in the local-renaming case (every Name node with id i/os has it changed to j/zipfile). Unless you implement the exact same logic as the compiler to distinguish local and global names, there's no way to allow local renaming but not global renaming.

And it gets even worse if you consider closures; two functions could have identical ASTs but different meanings, and even applying the compiler's name-distinguishing logic doesn't help, unless you also look at the context around the definition.

But there's an obvious answer here that takes care of all of this: Just has the compiled code objects. I'm not sure _exactly_ which attributes you want, but something like co_code, co_flags, co_consts, co_names, co_freevars, and maybe the ones related to the parameters.

I don't think you need anything from the function, class, or module that owns the code. (Yes, two functions with identical __code__ (including co_names and co_freevars) but different __globals__ or __closure__ will act differently… but I don't think there's any reasonable rule you could apply except either (a) ignore them, or (b) raise an exception if f.__globals__ != globals() or f.__closure__. Besides, the functions will also act differently if you just change the values of global variables. So I think just ignore them.)

Anyway, all of these attributes are easy to hash: one's a bytes, one's a fixed-size int, and the rest are tuples of strings.

There are some obvious downsides to this, but I don't think any of them are too serious. For example, you can't hash anything that doesn't both parse and compile, while you can build an AST from code that just parses. But practically, there aren't too many good examples of things that parse but won't compile; if you really want to be able to hash invalid code, you really need to stick with source.

But I'm sure someone will come up with something big and obvious that I'm missing.