[Python-ideas] AST Hash

anatoly techtonik techtonik at gmail.com
Sat Sep 28 11:59:03 CEST 2013

On Wed, Sep 11, 2013 at 8:05 PM, Amaury Forgeot d'Arc
<amauryfa at gmail.com> wrote:
> 2013/9/11 anatoly techtonik <techtonik at gmail.com>
>> Hi,
>> We need a checksum for code pieces. The goal of the checksum is to
>> reliably detect pieces of code with absolutely identical behaviour.
>> Borders of such checksum can be functions, classes, modules,.
> This looks like a nice project; I think this should first take the form of
> an external package.
> I'm sure there are many details to iron before this kind of technique can be
> widely adopted.

Yes, it is just an idea.

> For example:
> - Is there only one kind of hash? you suggested to erase the differences in
> variable names, are there other possible customizations?

Yes. There are different kinds of hashes depending on purpose, that
why I explicitly mentioned that AST hashes are named. Every name
corresponds to single purpose and to single set of filtering rules. I
can see at les Possible customizations:

-- 1 comments, docstrings and wihtespace handling --
1. preserve all whitespace including comments
2. preserve comments
3. standard erase comments, preserve docstrings
4. erase comments in addition to docstrings

-- 2 variable names handling --
1. preserve all
2. preserve external
3. preserve stdlib names (stdlib needs to be described to detect
namespace is from stdlib)
4. preserve thirdparty module names
5. preserve classes, rename variables
6. rename everything (abstract pattern matching)

Are stdlib detection ideas welcome?

> - To detect common patterns, is it interesting to hash and index all the
> nodes of an AST tree?

I am not sure, I need these hashes for sharing and detecting updates
to code snippets contained in various .py files across various Python
projects. I like to think that snippets are constrained on function or
class boundary, or else the management is rather tiresome.

> - Is there a central repository to store hashes of recipes? Is Google Search
> enough?

Google search indexes hashes of each revision for Mercurial
repositories. Sure it can do this too. Maintaining and downloading
files and snippets by hash from PyPI would be interesting. It seems
that most cloud storage solutions use hashes for storage, so
implementing this should be even easier than installing PyPI mirror.

> I don't need answers, only a reference implementation that people can
> discuss!

Reference implementation will take some time for sure. It may never be
done even, because things like
https://bitbucket.org/techtonik/python-stdlib/ have higher priority
and don't have sponsors.
anatoly t.

More information about the Python-ideas mailing list