Extend ast with types for <type>* instead of using raw lists

When walking an ast it impossible to know the type of an empty list without writing down some giant lookup from node types and field names to field types. More concretely it would nice be to able to programatically visit all blocks (stmt*) without having to something like: ``` class BlockVisitor(NodeVisitor): def visit_If(self, node: If): self.visit(node.test) self.visit_block(node.body) self.visit_block(node.orelse) def visit_FunctionDef(self, node: FunctionDef): for field, value in iter_fields(node): if field == 'body': self.visit_block(value) else: # the implementation of generic_visit ``` Now it turns out that all fields that are lists and are named "body", "orelse", or "finalbody" are stmt* and only such fields are stmt*. A rule could also be synthesized to identify expr* and so forth but this seems incredibly hacky to me. It would be much cleaner if <type>* were actual nodes in the ast. E.g. something like: ``` class ast_list(AST, MutableSequence[T_co]): ... class StmtList(ast_list[stmt]): ... class ExprList(ast_list[expr]): ... ... class FunctionDef(stmt): name: identifier args: arguments body: StmtList decorator_list: ExprList returns: Optional[expr] ``` This would not change the behavior or structure in any way other than tagging <type>* and allowing <type>* to be visited. It would potentially break old code which relies on stuff like `if isinstance(node.field, list)` e.g. the implementation of generic_visit. Caleb Donovick

On one hand I can see how this may cause little inconvenience, but on other hand this would be a breaking change, so I don't think it is realistic. Also I think this is often alleviated by using super(). Maybe it is possible to preserve backwards compatibility by making ast_list a subclass of list? Or is it not possible for some reason? -- Ivan On Thu, 15 Aug 2019 at 22:32, Caleb Donovick <donovick@cs.stanford.edu> wrote:

On Aug 15, 2019, at 13:02, Caleb Donovick <donovick@cs.stanford.edu> wrote:
When walking an ast it impossible to know the type of an empty list without writing down some giant lookup from node types and field names to field types.
This part could be solved without making the lists AST nodes at all. Just use new bare subclasses of list for each of the kinds of lists, so a field in a node never has an empty list, it has an empty ExprList or StmtList or whatever. If that’s sufficient for your needs, that seems pretty easy, and I don’t think it would have any compatibility problems. If that’s not sufficient, and you really do need StmtList, etc. to be nodes, it’s a bit trickier, but I think still doable. Before I get to that, it seems like either the simple version above or the more involved version below is something you could write pretty easily (replacing ast.py with a fork but still using the same _ast.so under the covers), and share as a third-party module. I’m not sure how useful it would be on PyPI, but it would at least be useful to demonstrate to people here. Of course the real implementation will probably require substantial changes to the C code, but if you can build a playable prototype without any such changes, that’s always handy. Also, while a I can see a use for StmtList (because every StmtList is a block or a module body, so they all have quite a bit in common), where would you use ExprList? Is there anything you want to do with both decorator chains and del targets but not with non-ExprList nodes? I think to actually be useful, you’d need a bunch of separate subclasses of ExprList. And meanwhile, some of the lists that the grammar requires, like WithItemList, don’t seem like they’d ever be useful either. So, you are exposing a lot of new public types that nobody will ever care about, making it harder to find the ones they do care about in help, etc. I could be wrong; maybe you do need to distinguish among all these different kinds of lists by type, but don’t need to distinguish the different kinds of ExprList by type. But this implies that you probably need to give us a more concrete use case before the idea can be judged fairly. Anyway, on to the fun bit:
The first problem with treating lists as nodes is that the grammar doesn’t treat lists as nodes, so you’re introducing a disconnect. But it’s obviously a useful disconnect in some cases. So maybe the right answer is to add a new flag param lists_as_nodes=False to iter_child_nodes (and any functions that pass through to it like walk), NodeVisitor, and NodeTransformer? Making them nodes also means visiting an AST in 3.9 would see a very different tree than visiting the same AST in 3.8. And, while ast explicitly allows for breaking changes like that between versions, it still seems like a good idea to avoid gratuitous massive changes when possible. But the flag solves that too. The bigger problem is that if you have any code that walks ASTs manually— including generic code similar to walk or NodeVisitor—that code isn’t going to know how to walk the elements of an ast_list, because they aren’t fields. That’s a huge backward compatibility problem. But actually, I think this can be solved very easily: just add a field to ast_list named elts that contains a reference to self (or even a copy of the same list as a plain old list). So that just leaves the type issue you started with. You want ast_list to be an AST node, ideally without breaking the fact that it’s a list. But you can’t subclass both list and AST because AST is a structseq, and you can only have one base class with struct members. But I don’t think there’s any actual behavior in AST that you need (except maybe for looking like a structseq, with _fields, but you can fake that with a plain old class a la namedtuple or dataclass, as long as there’s no C code that relies on being able to iterate the structseq fields). It I’m right, there are two possible solutions you could explore. Option 1 is to rename AST to _AST, then make a new empty base class AST that both _AST and ast_list subclass. (The lists_as_nodes flag then just selects between testing AST and _AST.) Option 2 is to only virtually subclass AST. The new ast_list inherits only from list, but AST has a subclass hook that accepts ast_list as a subclass. (The lists_as_nodes flag becomes slightly more complicated, but still pretty simple—and besides, that’s all under-the-covers code.)

You can dump the AST back to a string with Parso. In fact it's built in and way better because it keeps the formatting and comments!
Further, it is highly desirable for me to be able to turn the AST back into a string (as astor allows) so that I can generate reasonable error messages and debug.
Then you really should look at parso!

On one hand I can see how this may cause little inconvenience, but on other hand this would be a breaking change, so I don't think it is realistic. Also I think this is often alleviated by using super(). Maybe it is possible to preserve backwards compatibility by making ast_list a subclass of list? Or is it not possible for some reason? -- Ivan On Thu, 15 Aug 2019 at 22:32, Caleb Donovick <donovick@cs.stanford.edu> wrote:

On Aug 15, 2019, at 13:02, Caleb Donovick <donovick@cs.stanford.edu> wrote:
When walking an ast it impossible to know the type of an empty list without writing down some giant lookup from node types and field names to field types.
This part could be solved without making the lists AST nodes at all. Just use new bare subclasses of list for each of the kinds of lists, so a field in a node never has an empty list, it has an empty ExprList or StmtList or whatever. If that’s sufficient for your needs, that seems pretty easy, and I don’t think it would have any compatibility problems. If that’s not sufficient, and you really do need StmtList, etc. to be nodes, it’s a bit trickier, but I think still doable. Before I get to that, it seems like either the simple version above or the more involved version below is something you could write pretty easily (replacing ast.py with a fork but still using the same _ast.so under the covers), and share as a third-party module. I’m not sure how useful it would be on PyPI, but it would at least be useful to demonstrate to people here. Of course the real implementation will probably require substantial changes to the C code, but if you can build a playable prototype without any such changes, that’s always handy. Also, while a I can see a use for StmtList (because every StmtList is a block or a module body, so they all have quite a bit in common), where would you use ExprList? Is there anything you want to do with both decorator chains and del targets but not with non-ExprList nodes? I think to actually be useful, you’d need a bunch of separate subclasses of ExprList. And meanwhile, some of the lists that the grammar requires, like WithItemList, don’t seem like they’d ever be useful either. So, you are exposing a lot of new public types that nobody will ever care about, making it harder to find the ones they do care about in help, etc. I could be wrong; maybe you do need to distinguish among all these different kinds of lists by type, but don’t need to distinguish the different kinds of ExprList by type. But this implies that you probably need to give us a more concrete use case before the idea can be judged fairly. Anyway, on to the fun bit:
The first problem with treating lists as nodes is that the grammar doesn’t treat lists as nodes, so you’re introducing a disconnect. But it’s obviously a useful disconnect in some cases. So maybe the right answer is to add a new flag param lists_as_nodes=False to iter_child_nodes (and any functions that pass through to it like walk), NodeVisitor, and NodeTransformer? Making them nodes also means visiting an AST in 3.9 would see a very different tree than visiting the same AST in 3.8. And, while ast explicitly allows for breaking changes like that between versions, it still seems like a good idea to avoid gratuitous massive changes when possible. But the flag solves that too. The bigger problem is that if you have any code that walks ASTs manually— including generic code similar to walk or NodeVisitor—that code isn’t going to know how to walk the elements of an ast_list, because they aren’t fields. That’s a huge backward compatibility problem. But actually, I think this can be solved very easily: just add a field to ast_list named elts that contains a reference to self (or even a copy of the same list as a plain old list). So that just leaves the type issue you started with. You want ast_list to be an AST node, ideally without breaking the fact that it’s a list. But you can’t subclass both list and AST because AST is a structseq, and you can only have one base class with struct members. But I don’t think there’s any actual behavior in AST that you need (except maybe for looking like a structseq, with _fields, but you can fake that with a plain old class a la namedtuple or dataclass, as long as there’s no C code that relies on being able to iterate the structseq fields). It I’m right, there are two possible solutions you could explore. Option 1 is to rename AST to _AST, then make a new empty base class AST that both _AST and ast_list subclass. (The lists_as_nodes flag then just selects between testing AST and _AST.) Option 2 is to only virtually subclass AST. The new ast_list inherits only from list, but AST has a subclass hook that accepts ast_list as a subclass. (The lists_as_nodes flag becomes slightly more complicated, but still pretty simple—and besides, that’s all under-the-covers code.)

You can dump the AST back to a string with Parso. In fact it's built in and way better because it keeps the formatting and comments!
Further, it is highly desirable for me to be able to turn the AST back into a string (as astor allows) so that I can generate reasonable error messages and debug.
Then you really should look at parso!
participants (4)
-
Anders Hovmöller
-
Andrew Barnert
-
Caleb Donovick
-
Ivan Levkivskyi