<div dir="ltr">I just finished writing a rather <a href="https://github.com/lihaoyi/macropy/blob/master/macropy/core/macros.py#L178-L210">knotty piece of code</a> to try and extract the original source code that went into creating an AST. The whole thing is a nasty hack, and I was wondering if there was a better way of doing things. In particular, some obvious techniques aren't sufficient:<div>
<br></div><div style>- lineno/col_offset just tells you where the AST starts, not where it ends. The next AST's lineno/col_offset tells you where the next AST starts, which is not where the previous one ends: it could include a whole bunch of trash (whitespace/closing parentheses/comments/etc.) that I don't want.</div>
<div style>- unparsing the AST (via<a href="http://svn.python.org/projects/python/trunk/Demo/parser/unparse.py"> unparser.py</a> or similar) is not sufficient, because the original parsing has already thrown away a bunch of information from the original source, e.g. redundant parentheses, comments. You can get something which runs identically, but you can't get the exact original source.</div>
<div style><br></div><div style>I want the original source code for debugging/tracing purposes: I want my debugging asserts/tracing macros to show me the original source code of the condition which failed, and not source code + extra junk or source code + reshuffled parentheses (as would be the case with the two techniques used above). However, other possible uses come to mind:</div>
<div style><br></div><div style>- It would make tools like 2to3.py much simpler, since you could work purely at an AST level and just say "give me original source here!" for the parts which don't need to be changed. Currently it has its own lexer/parser, which is necessary (under the status quo) for reasons given above, but seems like a great waste when there's already a perfectly good lexer/parser in the ast module.</div>
<div style>- Automatically extracting the source code from unit tests to insert into documentation, which would be much easier if I could work purely at an AST level.</div><div style><br></div><div style>So what's there to do? I've described why the two techniques above are each insufficient, but together, you can:</div>
<div style><br></div><div style>- Bound the extent of an AST in the source code using the AST's subtree's minimal and maximal lineno/col_offset, along with it's successor's minimal lineno/col_offset</div>
<div style>
- scrub that extent with ast.parse, trying to parse each and every possible string and (if it parses) unparsing it to check semantic equality with the original AST</div><div style><br></div><div style>This is terribly hacky, the asymptotic performance is not good, and you could say many other nasty words about it. And all because I need to retrieve some information (source_length of the AST) that the parser probably already had, but conveniently threw away before giving me the AST.</div>
<div style><br></div><div style>Would it make sense to have the parser preserve the source_length in the ast.AST objects, along with the lineno and col_offset? This would take a miniscule amount of additional storage, is information that i'm sure it already has, and would greatly benefit the use cases I described above.</div>
<div style><br></div><div style>-Haoyi</div></div>