[Python-ideas] AST Hash

Wed Sep 11 19:05:52 CEST 2013

2013/9/11 anatoly techtonik <techtonik at gmail.com>

> Hi,
>
> We need a checksum for code pieces. The goal of the checksum is to
> reliably detect pieces of code with absolutely identical behaviour.
> Borders of such checksum can be functions, classes, modules,.
>

This looks like a nice project; I think this should first take the form of
an external package.
I'm sure there are many details to iron before this kind of technique can
be widely adopted.

For example:
- Is there only one kind of hash? you suggested to erase the differences in
variable names, are there other possible customizations?
- To detect common patterns, is it interesting to hash and index all the
nodes of an AST tree?
- Is there a central repository to store hashes of recipes? Is Google
Search enough?

I don't need answers, only a reference implementation that people can
discuss!

Good luck,

> Practical application for such checksums are:
>
>  - detecting usage of recipes and examples across PyPI packages
>  - detecting usage of standard stdlib calls
>  - creating execution safe serialization formats for data
>    - choosing class to deserialize data fields of the object based on its
> hash
>  - enable consistent validation and testing of results across various AST
> tools
>
> There can be two approaches to build such checksum:
> 1. Code Section Hash
> 2. AST Hash
>
> Code Section Hash is built from a substring of a source code, cut on
> function or class boundaries. This hash is flaky - whitespace and
> comment differences ruin it, even when behaviour (and bytecode) stays
> the same. It is possible to reduce the effect of whitespace and
> comment changes by normalizing the substring - dedenting, reindenting
> with 4 spaces, stripping empty lines, comments and trailing
> whitespace. And it still will be unreliable and affected by whitespace
> changes in the middle of the string. Therefore a 2nd way of hashing is
> more preferable.
>
> AST Hash is build on AST. This excludes any comments, whitespace etc.
> and makes the hash strict and reliable. This is a canonical Default
> AST Hash.
>
> There are cases when Default AST Hash may not be enough for
> comparison. For example, if local variables are renamed, or docstrings
> changed, the behaviour of a function may not change, but its AST hash
> will. In these cases additional normalization rules apply. Such as
> changing all local variable names to var1, var2, ... in order of
> appearance, stripping docstrings etc. Every set of such normalization
> rules should have a name. This will also be the name of resulting
> custom AST Hash.
>
> Explicit naming of AST Hashes and hardlinking of names to rules that
> are used to build them will settle common ground (base) for AST tools
> interoperability and research papers. As such, it most likely require
> a separate PEP.
> --
> anatoly t.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
>

-- 
Amaury Forgeot d'Arc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130911/d7bb45e4/attachment.html>