Recent experience with the _ast module
I've been working this past week at converting the stdlib's compiler package to use 2.5's new _ast module instead of the package's own AST. Overall, _ast was a joy to work with, save for the following nitpicks: 1) There are times when the _fields attribute on some AST nodes is None; if this was done to indicate that a given node has no AST-related attributes, it would be much easier if _fields was [] in this case. As it is, I had to special-case "node._fields is None" in the visitor so that I don't accidentally iterate over it. 2) {BinOp,AugAssign,BoolOp,etc}.op is an instance of, eg, Add, Sub, Mult, Mod, meaning you have to dispatch based on tests like "isinstance(node.op, x)" or "type(node.op) is x". I would much, much prefer to spell this "node.op is x", ie, use "node.op = Add" rather than the current "node.op = Add()" when constructing the nodes. 3) I'd like there to be an Else() node for If.orelse, While.orelse, etc. My motivation is that I need the lineno and col_offset values of the "else" statement for a code-coverage utility; as it is, I have to find the last lineno of the if/while/etc suite and the first lineno of the "else" suite and then search between those two for the "else" statement. Thoughts? Thanks, Collin Winter
On 2/12/07, Collin Winter
I've been working this past week at converting the stdlib's compiler package to use 2.5's new _ast module instead of the package's own AST. Overall, _ast was a joy to work with, save for the following nitpicks:
1) There are times when the _fields attribute on some AST nodes is None; if this was done to indicate that a given node has no AST-related attributes, it would be much easier if _fields was [] in this case. As it is, I had to special-case "node._fields is None" in the visitor so that I don't accidentally iterate over it.
That makes total sense to me.
2) {BinOp,AugAssign,BoolOp,etc}.op is an instance of, eg, Add, Sub, Mult, Mod, meaning you have to dispatch based on tests like "isinstance(node.op, x)" or "type(node.op) is x". I would much, much prefer to spell this "node.op is x", ie, use "node.op = Add" rather than the current "node.op = Add()" when constructing the nodes.
I can't think of a reason, off the top of my head, why they can't be singletons. It actually makes sense since they are basically an enumeration value in the AST grammar.
3) I'd like there to be an Else() node for If.orelse, While.orelse, etc. My motivation is that I need the lineno and col_offset values of the "else" statement for a code-coverage utility; as it is, I have to find the last lineno of the if/while/etc suite and the first lineno of the "else" suite and then search between those two for the "else" statement.
Ouch. I don't know how much work it would be to make this change, but it seems reasonable to want.
Thoughts?
They all sound reasonable. And it's nice to have a wanted feature be found from actual use. =) -Brett
On 2/13/07, Greg Ewing
Brett Cannon wrote:
On 2/12/07, Collin Winter
wrote: 2) {BinOp,AugAssign,BoolOp,etc}.op
I can't think of a reason, off the top of my head, why they can't be singletons.
Why not just use interned strings?
Because the AST code at the C level represent the 'op' value as basically an enumeration value, not a string. So creating an object is somewhat easier then trying to do the conversion to an interned string. Or at least I suspect; Martin can tell me I'm wrong if I am. -Brett
Brett Cannon wrote:
Because the AST code at the C level represent the 'op' value as basically an enumeration value, not a string. So creating an object is somewhat easier then trying to do the conversion to an interned string.
It would seem even easier (and a lot faster) to use an array to go from C enum --> some object, which could as well be an interned string as anything else. -- Greg
Greg Ewing schrieb:
It would seem even easier (and a lot faster) to use an array to go from C enum --> some object, which could as well be an interned string as anything else.
Have you actually looked at the code? How could it be faster than PyObject* ast2obj_boolop(boolop_ty o) { switch(o) { case And: Py_INCREF(And_singleton); return And_singleton; case Or: Py_INCREF(Or_singleton); return Or_singleton; } return NULL; /* cannot happen */ } Not sure what interned strings have to do with it. Regards, Martin
Martin v. Löwis wrote:
switch(o) { case And: Py_INCREF(And_singleton); return And_singleton; case Or: Py_INCREF(Or_singleton); return Or_singleton;
Well, And_singleton and Or_singleton could be anything, including integers or interned strings. Creating a class and then a singleton instance for each one seems like overkill and unnecessary bloat to me. -- Greg
Greg Ewing schrieb:
Brett Cannon wrote:
On 2/12/07, Collin Winter
wrote: 2) {BinOp,AugAssign,BoolOp,etc}.op I can't think of a reason, off the top of my head, why they can't be singletons.
Why not just use interned strings?
Because it's generated code. It has to follow rules, and it would be good if there were few exceptions. Regards, Martin
On 2/12/07, Brett Cannon
They all sound reasonable. And it's nice to have a wanted feature be found from actual use. =)
I've just uploaded patch #1659410 to SF, implementing the changes I outlined: 1) I changed some of the code emitted by asdl_c.py so _fields is always a tuple. Where _fields used to be None, it's now a 0-element tuple. 2) It turned out that {BinOp, BoolOp,AugAssign,etc}.op were already singleton instances of their respective classes. I've changed asdl_c.py to no longer emit the *_singleton names and to use the corresponding *_type types in their place, which will enable me to write "node.op is _ast.Add", etc. 3) Adding an Else node was a little more involved, but it proved do-able. TryFinally.orelse, While.orelse, For.orelse and If.orelse can now be either None or an instance of Else. Else instances have a 'body' attribute, which is a sequence of statements. Thanks, Collin Winter
Collin Winter schrieb:
2) It turned out that {BinOp, BoolOp,AugAssign,etc}.op were already singleton instances of their respective classes. I've changed asdl_c.py to no longer emit the *_singleton names and to use the corresponding *_type types in their place, which will enable me to write "node.op is _ast.Add", etc.
I don't really like this - it's inconsistent. I'd rather prefer if the singletons where exposed under a name, e.g. _ast.Add.singleton, or _ast.add (if that doesn't cause conflicts).
3) Adding an Else node was a little more involved, but it proved do-able. TryFinally.orelse, While.orelse, For.orelse and If.orelse can now be either None or an instance of Else. Else instances have a 'body' attribute, which is a sequence of statements.
I still can't see the need (we don't have line numbers for every token in the AST), but I can't see anything wrong with it, either (except for the change in the structure, of course, but that will happen, anyway). Regards, Martin
On 2/14/07, "Martin v. Löwis"
Collin Winter schrieb:
2) It turned out that {BinOp, BoolOp,AugAssign,etc}.op were already singleton instances of their respective classes. I've changed asdl_c.py to no longer emit the *_singleton names and to use the corresponding *_type types in their place, which will enable me to write "node.op is _ast.Add", etc.
I don't really like this - it's inconsistent. I'd rather prefer if the singletons where exposed under a name, e.g. _ast.Add.singleton, or _ast.add (if that doesn't cause conflicts).
What's inconsistent about it? That classes are being used for the _ast.{Add,Sub,Mult,etc} names? I don't see the need for both _ast.Add and _ast.Add.singleton or _ast.add or however else it might be spelled. I'd be perfectly happy doing something like "_ast.Add = object()" (when initializing the _ast module), so long as I can write "node.op is _ast.Add", "node.op == _ast.Add", or something equally brief and to-the-point. Collin Winter
Collin Winter schrieb:
What's inconsistent about it? That classes are being used for the _ast.{Add,Sub,Mult,etc} names?
Exactly. These aren't names - they are nodes in the tree. All nodes are instances of _ast.AST.
I don't see the need for both _ast.Add and _ast.Add.singleton or _ast.add or however else it might be spelled. I'd be perfectly happy doing something like "_ast.Add = object()" (when initializing the _ast module), so long as I can write "node.op is _ast.Add", "node.op == _ast.Add", or something equally brief and to-the-point.
Would you like to do the same for Pass, Break, Continue, and Ellipsis? They are also "just names". If you make _ast.Add the entry in the tree itself, why not _ast.Break? Or, if you have a way to deal with _ast.Break, why can't the same way work for _ast.Add? Regards, Martin
On 2/14/07, "Martin v. Löwis"
Collin Winter schrieb:
What's inconsistent about it? That classes are being used for the _ast.{Add,Sub,Mult,etc} names?
Exactly. These aren't names - they are nodes in the tree. All nodes are instances of _ast.AST.
I don't see the need for both _ast.Add and _ast.Add.singleton or _ast.add or however else it might be spelled. I'd be perfectly happy doing something like "_ast.Add = object()" (when initializing the _ast module), so long as I can write "node.op is _ast.Add", "node.op == _ast.Add", or something equally brief and to-the-point.
Would you like to do the same for Pass, Break, Continue, and Ellipsis?
They are also "just names". If you make _ast.Add the entry in the tree itself, why not _ast.Break? Or, if you have a way to deal with _ast.Break, why can't the same way work for _ast.Add?
But Pass, Break, Continue and Ellipsis aren't in the same category as Add, Mult, Div, etc.. The former stand alone, while the only time you'll ever encounter Add, Mult, Mod and co. is in the context of a BoolOp, BinOp or whatever. I could of course add separate visitor methods for each op type to my code generator, but that's a lot of boilerplate code for very little gain. I could use a dispatch table in visitBoolOp, and dispatch like "dispatch[type(node.op)]", but why? In this scenario, node.op doesn't convey any information other than its type. Add, Mult, etc are just values in an enumeration, so why not treat them that way and be able to write "dispatch[node.op]". So: what if _ast.Add and co. were singleton instances of _ast.AST or even of a subclass of AST, like _ast.Op? That way, we keep the consistency that all AST nodes are instances of _ast.AST and I get the nice property I'm looking for, ie, "node.op is _ast.Mult". Collin Winter
On 2/16/07, Collin Winter
On 2/14/07, "Martin v. Löwis"
wrote: Collin Winter schrieb:
What's inconsistent about it? That classes are being used for the _ast.{Add,Sub,Mult,etc} names?
Exactly. These aren't names - they are nodes in the tree. All nodes are instances of _ast.AST.
I don't see the need for both _ast.Add and _ast.Add.singleton or _ast.add or however else it might be spelled. I'd be perfectly happy doing something like "_ast.Add = object()" (when initializing the _ast module), so long as I can write "node.op is _ast.Add", "node.op == _ast.Add", or something equally brief and to-the-point.
Would you like to do the same for Pass, Break, Continue, and Ellipsis?
They are also "just names". If you make _ast.Add the entry in the tree itself, why not _ast.Break? Or, if you have a way to deal with _ast.Break, why can't the same way work for _ast.Add?
But Pass, Break, Continue and Ellipsis aren't in the same category as Add, Mult, Div, etc.. The former stand alone, while the only time you'll ever encounter Add, Mult, Mod and co. is in the context of a BoolOp, BinOp or whatever. I could of course add separate visitor methods for each op type to my code generator, but that's a lot of boilerplate code for very little gain. I could use a dispatch table in visitBoolOp, and dispatch like "dispatch[type(node.op)]", but why? In this scenario, node.op doesn't convey any information other than its type. Add, Mult, etc are just values in an enumeration, so why not treat them that way and be able to write "dispatch[node.op]".
So: what if _ast.Add and co. were singleton instances of _ast.AST or even of a subclass of AST, like _ast.Op? That way, we keep the consistency that all AST nodes are instances of _ast.AST and I get the nice property I'm looking for, ie, "node.op is _ast.Mult".
If you look back in the history of this thread I think Martin suggested this: "I'd rather prefer if the singletons where exposed under a name, e.g. _ast.Add.singleton, or _ast.add (if that doesn't cause conflicts)." Since the both of you have proposed this I think this is the best way to move forward. -Brett
On 2/16/07, Brett Cannon
On 2/16/07, Collin Winter
wrote: On 2/14/07, "Martin v. Löwis"
wrote: Collin Winter schrieb:
What's inconsistent about it? That classes are being used for the _ast.{Add,Sub,Mult,etc} names?
Exactly. These aren't names - they are nodes in the tree. All nodes are instances of _ast.AST.
I don't see the need for both _ast.Add and _ast.Add.singleton or _ast.add or however else it might be spelled. I'd be perfectly happy doing something like "_ast.Add = object()" (when initializing the _ast module), so long as I can write "node.op is _ast.Add", "node.op == _ast.Add", or something equally brief and to-the-point.
Would you like to do the same for Pass, Break, Continue, and Ellipsis?
They are also "just names". If you make _ast.Add the entry in the tree itself, why not _ast.Break? Or, if you have a way to deal with _ast.Break, why can't the same way work for _ast.Add?
But Pass, Break, Continue and Ellipsis aren't in the same category as Add, Mult, Div, etc.. The former stand alone, while the only time you'll ever encounter Add, Mult, Mod and co. is in the context of a BoolOp, BinOp or whatever. I could of course add separate visitor methods for each op type to my code generator, but that's a lot of boilerplate code for very little gain. I could use a dispatch table in visitBoolOp, and dispatch like "dispatch[type(node.op)]", but why? In this scenario, node.op doesn't convey any information other than its type. Add, Mult, etc are just values in an enumeration, so why not treat them that way and be able to write "dispatch[node.op]".
So: what if _ast.Add and co. were singleton instances of _ast.AST or even of a subclass of AST, like _ast.Op? That way, we keep the consistency that all AST nodes are instances of _ast.AST and I get the nice property I'm looking for, ie, "node.op is _ast.Mult".
If you look back in the history of this thread I think Martin suggested this: "I'd rather prefer if the singletons where exposed under a name, e.g. _ast.Add.singleton, or _ast.add (if that doesn't cause conflicts)."
Since the both of you have proposed this I think this is the best way to move forward.
That's what I get for letting this thread drop for a few days. I read Martin's original proposal as wanting to have _ast.add (singleton instance) *in addition to* _ast.Add (presumably the class of _ast.add). To be clear, I want to replace the current _ast.Add (a class, subclassing AST) with an instance of _ast.Op or something comparable; making _ast.{Load,Store,Del,AugLoad,AugStore,Param} into instances of an _ast.Context class would be a part of this. Collin Winter
Collin Winter schrieb:
But Pass, Break, Continue and Ellipsis aren't in the same category as Add, Mult, Div, etc.. The former stand alone
That's not true. Pass, Break, Continue don't stand alone; they are members of the body sequence of some other statement (in particular for Break and Continue), you need some kind of loop around it. In any case, they are in the same category as they have no child nodes, which is a prerequisite for not creating objects. I can't see why the property "stand alone" should impact whether objects are created or not.
So: what if _ast.Add and co. were singleton instances of _ast.AST or even of a subclass of AST, like _ast.Op? That way, we keep the consistency that all AST nodes are instances of _ast.AST and I get the nice property I'm looking for, ie, "node.op is _ast.Mult".
I like this better. The base class you are looking for is _ast.operator; it should be already there (please do take a look at Python.asdl to see how I came to this conclusion, without studying the _ast module again). Regards, Martin
On 2/17/07, "Martin v. Löwis"
Collin Winter schrieb:
But Pass, Break, Continue and Ellipsis aren't in the same category as Add, Mult, Div, etc.. The former stand alone
That's not true. Pass, Break, Continue don't stand alone; they are members of the body sequence of some other statement (in particular for Break and Continue), you need some kind of loop around it.
In any case, they are in the same category as they have no child nodes, which is a prerequisite for not creating objects. I can't see why the property "stand alone" should impact whether objects are created or not.
I phrased that poorly. If Pass, Break and Continue were made into singleton instances, they wouldn't be able to be handled by the same visitor dispatch routine as the other statement types. Taking Demo/parser/unparse.py as an example, Unparser.dispatch would have to change to handle Break, Pass and Continue specially from the others. Changing the operator and context nodes to be singleton instances involves just modifying the dispatch tables used by _BoolOp, _BinOp, etc. Collin Winter
Collin Winter schrieb:
1) There are times when the _fields attribute on some AST nodes is None; if this was done to indicate that a given node has no AST-related attributes, it would be much easier if _fields was [] in this case. As it is, I had to special-case "node._fields is None" in the visitor so that I don't accidentally iterate over it.
That can be done, I think.
2) {BinOp,AugAssign,BoolOp,etc}.op is an instance of, eg, Add, Sub, Mult, Mod, meaning you have to dispatch based on tests like "isinstance(node.op, x)" or "type(node.op) is x". I would much, much prefer to spell this "node.op is x", ie, use "node.op = Add" rather than the current "node.op = Add()" when constructing the nodes.
Please look at Python.asdl. This things are really belong to sum nodes, and Add, Sub etc. are really operators. I *think* they are singletons, and I don't mind making it a guarantee that they are: i.e. all operators that have no fields and where the sum has no attributes. If you want to write "is", you can also write node.op.__class__ is Add
3) I'd like there to be an Else() node for If.orelse, While.orelse, etc. My motivation is that I need the lineno and col_offset values of the "else" statement for a code-coverage utility; as it is, I have to find the last lineno of the if/while/etc suite and the first lineno of the "else" suite and then search between those two for the "else" statement.
Notice that you cannot restore the original source code, anyway. You cannot tell whether something was else:if or elif. I don't understand why you need the line number of the else keyword (which, as I said, may not exist in the source) for code coverage: The 'else' is never executed (only the statements in it are). Regards, Martin
participants (4)
-
"Martin v. Löwis"
-
Brett Cannon
-
Collin Winter
-
Greg Ewing