[Python-ideas] AST Hash
mal at egenix.com
Wed Sep 11 22:53:39 CEST 2013
On 11.09.2013 18:05, anatoly techtonik wrote:
> We need a checksum for code pieces. The goal of the checksum is to
> reliably detect pieces of code with absolutely identical behaviour.
> Borders of such checksum can be functions, classes, modules,.
> Practical application for such checksums are:
> - detecting usage of recipes and examples across PyPI packages
> - detecting usage of standard stdlib calls
> - creating execution safe serialization formats for data
> - choosing class to deserialize data fields of the object based on its hash
> - enable consistent validation and testing of results across various AST tools
> There can be two approaches to build such checksum:
> 1. Code Section Hash
> 2. AST Hash
> Code Section Hash is built from a substring of a source code, cut on
> function or class boundaries. This hash is flaky - whitespace and
> comment differences ruin it, even when behaviour (and bytecode) stays
> the same. It is possible to reduce the effect of whitespace and
> comment changes by normalizing the substring - dedenting, reindenting
> with 4 spaces, stripping empty lines, comments and trailing
> whitespace. And it still will be unreliable and affected by whitespace
> changes in the middle of the string. Therefore a 2nd way of hashing is
> more preferable.
> AST Hash is build on AST. This excludes any comments, whitespace etc.
> and makes the hash strict and reliable. This is a canonical Default
> AST Hash.
> There are cases when Default AST Hash may not be enough for
> comparison. For example, if local variables are renamed, or docstrings
> changed, the behaviour of a function may not change, but its AST hash
> will. In these cases additional normalization rules apply. Such as
> changing all local variable names to var1, var2, ... in order of
> appearance, stripping docstrings etc. Every set of such normalization
> rules should have a name. This will also be the name of resulting
> custom AST Hash.
> Explicit naming of AST Hashes and hardlinking of names to rules that
> are used to build them will settle common ground (base) for AST tools
> interoperability and research papers. As such, it most likely require
> a separate PEP.
You might want to have a look at this paper which discussed
AST compression (for Java, but the ideas apply to Python just
If you compress the AST into a string and take its hash,
you should pretty much have what you want.
Professional Python Services directly from the Source (#1, Sep 11 2013)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2013-09-11: Released eGenix PyRun 1.3.0 ... http://egenix.com/go49
2013-09-04: Released eGenix pyOpenSSL 0.13.2 ... http://egenix.com/go48
2013-09-20: PyCon UK 2013, Coventry, UK ... 9 days to go
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
More information about the Python-ideas