<div dir="ltr">2013/9/11 anatoly techtonik <span dir="ltr"><<a href="mailto:techtonik@gmail.com" target="_blank">techtonik@gmail.com</a>></span><br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Hi,<br>

<br>

We need a checksum for code pieces. The goal of the checksum is to<br>

reliably detect pieces of code with absolutely identical behaviour.<br>

Borders of such checksum can be functions, classes, modules,.<br></blockquote><div><br></div><div>This looks like a nice project; I think this should first take the form of an external package.</div><div>I'm sure there are many details to iron before this kind of technique can be widely adopted.</div>

<div><br></div><div>For example:</div><div>- Is there only one kind of hash? you suggested to erase the differences in variable names, are there other possible customizations?<br></div><div>- To detect common patterns, is it interesting to hash and index all the nodes of an AST tree?</div>

<div>- Is there a central repository to store hashes of recipes? Is Google Search enough?</div><div><br></div><div>I don't need answers, only a reference implementation that people can discuss!<br></div><div><br></div>

<div>Good luck,</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Practical application for such checksums are:<br>

<br>

 - detecting usage of recipes and examples across PyPI packages<br>

 - detecting usage of standard stdlib calls<br>

 - creating execution safe serialization formats for data<br>

   - choosing class to deserialize data fields of the object based on its hash<br>

 - enable consistent validation and testing of results across various AST tools<br>

<br>

There can be two approaches to build such checksum:<br>

1. Code Section Hash<br>

2. AST Hash<br>

<br>

Code Section Hash is built from a substring of a source code, cut on<br>

function or class boundaries. This hash is flaky - whitespace and<br>

comment differences ruin it, even when behaviour (and bytecode) stays<br>

the same. It is possible to reduce the effect of whitespace and<br>

comment changes by normalizing the substring - dedenting, reindenting<br>

with 4 spaces, stripping empty lines, comments and trailing<br>

whitespace. And it still will be unreliable and affected by whitespace<br>

changes in the middle of the string. Therefore a 2nd way of hashing is<br>

more preferable.<br>

<br>

AST Hash is build on AST. This excludes any comments, whitespace etc.<br>

and makes the hash strict and reliable. This is a canonical Default<br>

AST Hash.<br>

<br>

There are cases when Default AST Hash may not be enough for<br>

comparison. For example, if local variables are renamed, or docstrings<br>

changed, the behaviour of a function may not change, but its AST hash<br>

will. In these cases additional normalization rules apply. Such as<br>

changing all local variable names to var1, var2, ... in order of<br>

appearance, stripping docstrings etc. Every set of such normalization<br>

rules should have a name. This will also be the name of resulting<br>

custom AST Hash.<br>

<br>

Explicit naming of AST Hashes and hardlinking of names to rules that<br>

are used to build them will settle common ground (base) for AST tools<br>

interoperability and research papers. As such, it most likely require<br>

a separate PEP.<br>

--<br>

anatoly t.<br>

_______________________________________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-ideas" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>Amaury Forgeot d'Arc

</div></div>