[Python-ideas] AST Hash

Sat Sep 28 12:30:35 CEST 2013

On Wed, Sep 11, 2013 at 11:53 PM, M.-A. Lemburg <mal at egenix.com> wrote:
> On 11.09.2013 18:05, anatoly techtonik wrote:
>> Hi,
>>
>> We need a checksum for code pieces. The goal of the checksum is to
>> reliably detect pieces of code with absolutely identical behaviour.
>> Borders of such checksum can be functions, classes, modules,.
>> Practical application for such checksums are:
>>
>>  - detecting usage of recipes and examples across PyPI packages
>>  - detecting usage of standard stdlib calls
>>  - creating execution safe serialization formats for data
>>    - choosing class to deserialize data fields of the object based on its hash
>>  - enable consistent validation and testing of results across various AST tools
>>
>> There can be two approaches to build such checksum:
>> 1. Code Section Hash
>> 2. AST Hash
>>
>> Code Section Hash is built from a substring of a source code, cut on
>> function or class boundaries. This hash is flaky - whitespace and
>> comment differences ruin it, even when behaviour (and bytecode) stays
>> the same. It is possible to reduce the effect of whitespace and
>> comment changes by normalizing the substring - dedenting, reindenting
>> with 4 spaces, stripping empty lines, comments and trailing
>> whitespace. And it still will be unreliable and affected by whitespace
>> changes in the middle of the string. Therefore a 2nd way of hashing is
>> more preferable.
>>
>> AST Hash is build on AST. This excludes any comments, whitespace etc.
>> and makes the hash strict and reliable. This is a canonical Default
>> AST Hash.
>>
>> There are cases when Default AST Hash may not be enough for
>> comparison. For example, if local variables are renamed, or docstrings
>> changed, the behaviour of a function may not change, but its AST hash
>> will. In these cases additional normalization rules apply. Such as
>> changing all local variable names to var1, var2, ... in order of
>> appearance, stripping docstrings etc. Every set of such normalization
>> rules should have a name. This will also be the name of resulting
>> custom AST Hash.
>>
>> Explicit naming of AST Hashes and hardlinking of names to rules that
>> are used to build them will settle common ground (base) for AST tools
>> interoperability and research papers. As such, it most likely require
>> a separate PEP.
>
> You might want to have a look at this paper which discussed
> AST compression (for Java, but the ideas apply to Python just
> as well):
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5917&rep=rep1&type=pdf
>
> If you compress the AST into a string and take its hash,
> you should pretty much have what you want.

Thanks for the link. The idea to transform AST to string is obvious -
I don't know other way to build hash than to feed some kind of binary
string to function. But the guys addressed different problem -
bytecode is harder to compress that AST, and I agree, because it is
easier to analyse common patterns in AST and tune compressing
algorithms accordingly. Structure and dependency in binary data
matters and binary compression algorithms are usually dump. Google
improved bsdiff compression a lot for executable by making it aware of
binary structure.

"..compressed AST provide ..., platform independence" - I thought that
Java byte code platform independent.

If I could write a paper for every idea that I have (or at least draw
a diagram), I could be a president of academy of sciences already. =)
Anyway, an interesting reading. Unfortunately, not much time for that.
I am not sure their implementation can be adopted as a prototype for
hash implementation, it seems that simple tree walker should do the
job.
--
anatoly t.