[Python-ideas] AST Hash

anatoly techtonik techtonik at gmail.com
Wed Sep 11 18:05:22 CEST 2013


We need a checksum for code pieces. The goal of the checksum is to
reliably detect pieces of code with absolutely identical behaviour.
Borders of such checksum can be functions, classes, modules,.
Practical application for such checksums are:

 - detecting usage of recipes and examples across PyPI packages
 - detecting usage of standard stdlib calls
 - creating execution safe serialization formats for data
   - choosing class to deserialize data fields of the object based on its hash
 - enable consistent validation and testing of results across various AST tools

There can be two approaches to build such checksum:
1. Code Section Hash
2. AST Hash

Code Section Hash is built from a substring of a source code, cut on
function or class boundaries. This hash is flaky - whitespace and
comment differences ruin it, even when behaviour (and bytecode) stays
the same. It is possible to reduce the effect of whitespace and
comment changes by normalizing the substring - dedenting, reindenting
with 4 spaces, stripping empty lines, comments and trailing
whitespace. And it still will be unreliable and affected by whitespace
changes in the middle of the string. Therefore a 2nd way of hashing is
more preferable.

AST Hash is build on AST. This excludes any comments, whitespace etc.
and makes the hash strict and reliable. This is a canonical Default
AST Hash.

There are cases when Default AST Hash may not be enough for
comparison. For example, if local variables are renamed, or docstrings
changed, the behaviour of a function may not change, but its AST hash
will. In these cases additional normalization rules apply. Such as
changing all local variable names to var1, var2, ... in order of
appearance, stripping docstrings etc. Every set of such normalization
rules should have a name. This will also be the name of resulting
custom AST Hash.

Explicit naming of AST Hashes and hardlinking of names to rules that
are used to build them will settle common ground (base) for AST tools
interoperability and research papers. As such, it most likely require
a separate PEP.
anatoly t.

More information about the Python-ideas mailing list