[Python-ideas] Could the ast module's ASTs preserve source_length in addition to lineno and col_offset?

Haoyi Li haoyi.sg at gmail.com
Thu May 30 01:06:05 CEST 2013


I just finished writing a rather knotty piece of
code<https://github.com/lihaoyi/macropy/blob/master/macropy/core/macros.py#L178-L210>
to
try and extract the original source code that went into creating an AST.
The whole thing is a nasty hack, and I was wondering if there was a better
way of doing things. In particular, some obvious techniques aren't
sufficient:

- lineno/col_offset just tells you where the AST starts, not where it ends.
The next AST's lineno/col_offset tells you where the next AST starts, which
is not where the previous one ends: it could include a whole bunch of trash
(whitespace/closing parentheses/comments/etc.) that I don't want.
- unparsing the AST (via
unparser.py<http://svn.python.org/projects/python/trunk/Demo/parser/unparse.py>
or
similar) is not sufficient, because the original parsing has already thrown
away a bunch of information from the original source, e.g. redundant
parentheses, comments. You can get something which runs identically, but
you can't get the exact original source.

I want the original source code for debugging/tracing purposes: I want my
debugging asserts/tracing macros to show me the original source code of the
condition which failed, and not source code + extra junk or source code +
reshuffled parentheses (as would be the case with the two techniques used
above). However, other possible uses come to mind:

- It would make tools like 2to3.py much simpler, since you could work
purely at an AST level and just say "give me original source here!" for the
parts which don't need to be changed. Currently it has its own
lexer/parser, which is necessary (under the status quo) for reasons given
above, but seems like a great waste when there's already a perfectly good
lexer/parser in the ast module.
- Automatically extracting the source code from unit tests to insert into
documentation, which would be much easier if I could work purely at an AST
level.

So what's there to do? I've described why the two techniques above are each
insufficient, but together, you can:

- Bound the extent of an AST in the source code using the AST's subtree's
minimal and maximal lineno/col_offset, along with it's successor's minimal
lineno/col_offset
- scrub that extent with ast.parse, trying to parse each and every possible
string and (if it parses) unparsing it to check semantic equality with the
original AST

This is terribly hacky, the asymptotic performance is not good, and you
could say many other nasty words about it. And all because I need to
retrieve some information (source_length of the AST) that the parser
probably already had, but conveniently threw away before giving me the AST.

Would it make sense to have the parser preserve the source_length in the
ast.AST objects, along with the lineno and col_offset? This would take a
miniscule amount of additional storage, is information that i'm sure it
already has, and would greatly benefit the use cases I described above.

-Haoyi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130529/70d516c5/attachment.html>


More information about the Python-ideas mailing list