[Compiler-sig] AST observations

Eric C. Newton ecn@metaslash.com
Wed, 17 Apr 2002 23:02:24 -0400


I've noticed a few things with the current implementation of ASTs in
Python 2.2 from my work on PyChecker2.  These are just some notes.

compiler.visitor:

   ASTVisitor
        As the comment says, it's not a visitor, it's a walker.
        Someone mentioned earlier that this is rather confusing.

   There is this comment:

        "If the visitor method returns a true value, the
        ASTVisitor will not traverse the child nodes."

   I see no code which checks the return value.

   Performance:

        For me, the _cache mechanism actually slows down
        visitation. I found that pre-caching the method names in
        preorder() _is_ faster.

        Most of my dispatching uses the default dispatcher; the
        getChildNodes() method, along with compiler.ast.flatten
        and compiler.ast.flatten_nodes are significant overheads.

        All that said, I re-wrote it using a number of techniques:

           removed the ability to add variable arguments to the walk
                (no improvement)

           fewer transformations from lists to tuples (minor improv.)

           smarter construction of lists (minor improv.)

           custom getChildNodes() for each class to eliminate
           calls to flatten() (minor improv.)

           pass a visit() function to a visitChildren() method (slower!)

           write a default recursive dispatcher in C (slower!)

   Convention:

        Setting the "visit" method on the Visitor is, ahem, a novel
        approach.  It's a convention that PyChecker (the use of, not
        the development of) doesn't like ("unknown method visit()").
        I don't know if I don't like it, but it was unexpected.
        Passing the walker to the dispatch function is the sort of
        thing I would expect.

In general, pychecker2 calls "walk(node, Visitor())" a LOT.  The first
version of pychecker did a lot of things in a single pass.  That is
pretty efficient, but it's harder to add more checks without creating
really dense code.  I'm trying to structure pychecker2 around lots of
independent checks, so it will be easier to contribute new code.  The
consequence is: I would really like the visitor stuff to run
efficiently.

I gave up on trying to use recursion to figure out a Node's parents in
an AST tree.  Very often I need to know the parents of the Node I'm
looking at, and using recursion to hold this information was becoming
cumbersome.  For the time being, I'm adding a parent link to all AST
nodes just after a file is parsed.  An alternative data-structure
(heap?) would be just as swell so long as I can compute this
efficiently.

Line numbers appear to be added to AST nodes in arbitrary ways.

Some interesting projects which re-write existing code might like
other token information, like comments.

People have requested features for pychecker, like detecting
unnecessary parens and semicolons, which is not possible, since these
are not part of the AST.

-Eric