[Python-ideas] AST Hash

Jason Bursey kn0m0n3 at gmail.com
Wed Sep 11 20:49:48 CEST 2013

يمكنك تعليم الطفل كيفية كتابة الخوارزميات مثل بيتر شور؟

On Wed, Sep 11, 2013 at 11:05 AM, anatoly techtonik <techtonik at gmail.com>wrote:

> Hi,
> We need a checksum for code pieces. The goal of the checksum is to
> reliably detect pieces of code with absolutely identical behaviour.
> Borders of such checksum can be functions, classes, modules,.
> Practical application for such checksums are:
>  - detecting usage of recipes and examples across PyPI packages
>  - detecting usage of standard stdlib calls
>  - creating execution safe serialization formats for data
>    - choosing class to deserialize data fields of the object based on its
> hash
>  - enable consistent validation and testing of results across various AST
> tools
> There can be two approaches to build such checksum:
> 1. Code Section Hash
> 2. AST Hash
> Code Section Hash is built from a substring of a source code, cut on
> function or class boundaries. This hash is flaky - whitespace and
> comment differences ruin it, even when behaviour (and bytecode) stays
> the same. It is possible to reduce the effect of whitespace and
> comment changes by normalizing the substring - dedenting, reindenting
> with 4 spaces, stripping empty lines, comments and trailing
> whitespace. And it still will be unreliable and affected by whitespace
> changes in the middle of the string. Therefore a 2nd way of hashing is
> more preferable.
> AST Hash is build on AST. This excludes any comments, whitespace etc.
> and makes the hash strict and reliable. This is a canonical Default
> AST Hash.
> There are cases when Default AST Hash may not be enough for
> comparison. For example, if local variables are renamed, or docstrings
> changed, the behaviour of a function may not change, but its AST hash
> will. In these cases additional normalization rules apply. Such as
> changing all local variable names to var1, var2, ... in order of
> appearance, stripping docstrings etc. Every set of such normalization
> rules should have a name. This will also be the name of resulting
> custom AST Hash.
> Explicit naming of AST Hashes and hardlinking of names to rules that
> are used to build them will settle common ground (base) for AST tools
> interoperability and research papers. As such, it most likely require
> a separate PEP.
> --
> anatoly t.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130911/9570121c/attachment-0001.html>

More information about the Python-ideas mailing list