Re: Non-standard evaluation for Python

(Re-sending, because this was originally a reply to an off-list message by Nima Hamidi) On Jul 13, 2019, at 14:12, Nima Hamidi <hamidi@stanford.edu> wrote:
Sometimes it's necessary not to evaluate the expression. Two such applications of NSE in R are as follows:
1. Data-tables have cleaner syntax. For example, letting dt be a data-table with a column called price, one can retrieve items cheaper than $1 using the following: dt [price < 1]. Pandas syntax requires something like dt[dt.price < 1]. This is currently inevitable as the expression is evaluated *before* __getitem__ is invoked. Using NSE, dt.__getitem__ can, first, add its columns to locals() dictionary and then evaluate the expression in the new context.
This one looks good. I can also imagine it being useful for SQLAlchemy, appscript, etc. just as it is for Pandas. But in your proposal, wouldn’t this have to be written as dt[`price < 1`]? I think the cost of putting the expression in ticks is at least as bad as the cost of naming the dt. Also: dt.price < 1 is a perfectly valid expression, with a useful value. You can store it in a temporary variable to avoid repeating it, or stash it for later, or print it out to see what’s happening. But price < 1 on its own is a NameError, and I’m not sure what `price < 1` is worth on its own. Would this invite code that’s hard to refactor and even harder to debug?
2. Pipe-lining in R is also much cleaner. Dplyr provided an operator %>% which passes the return value of its LHS as the first argument of its RHS. In other words, f() %>% g() is equivalent to g(f()). This is pretty useful for long pipelines. The way that it works is that the operator %>% changes AST and then evaluates the modified expression. In this example, evaluating g() is undesirable.
This doesn’t seem necessary in a language with first-class functions. Why can’t you just write the pipeline as something f %>% g, much as you would in, say, Haskell, which would just define a function (presumably equivalent to either lambda: g(f()) or lambda *a, **kw: g(f(*a, **kw))) that represents the pipeline that you then just call normally? I don’t see the benefit in being able to write g() instead of g here, and in fact it seems actively misleading, because it implies calling g on no arguments instead of one. Also, given that Python doesn’t have such an operator, or a way to define custom operators, and that proposals for even simpler operators on functions like @ for compose have been rejected every time they’ve been suggested, I wouldn’t expect much traction from this example. Is there something similar that could plausibly be done in Python, and feel Pythonic? —- A couple more things I thought of since the initial reply… I’m pretty sure Python’s AST objects don’t contain the original source text. So, what is your plot function going to actually do with its arguments to get the axes? What if it’s called with plot(`x[..., 3:]`)? Will plot—and every other function that wants to do something similar—need to come up with a way to generate the nicest source text that could produce the given AST? Or do we need to add a decompile to the stdlib for them? I suppose you could solve this by just adding more fields to BoundExpression, but I’m not sure that wouldn’t make it a lot harder to implement the backtick feature. Backticks are supposed to be banned for the life of Python since 3.0 eliminated them as shorthand for repr. That could be revisited, but it might be a tough sell. Maybe the original “grit on Tim’s screen” reason is no longer as compelling because of higher-res screens and more uniform console fonts, but the rise to ubiquity of markdown seems like an even better reason not to use them. Today, you can paste Python code between backticks to mark it as code in markdown; if Python code can contain backticks, that’s no longer true. People who use languages that rely on backticks have been complaining about that for years; so we want to join them? Finally, I think you need a fully worked-through example, not just a description of one. Show what the implementation of plot would look like if it could be handed BoundExpression objects. (Although pd.DataFrame.__getitem__ seems like the killer use case here, so maybe show that one instead, even though it’s probably more complicated.)

14.07.19 07:06, Andrew Barnert via Python-ideas пише:
The more interesting problem is that in general case you have not simple `price < 1`, but `price < x` where x is a variable or more complex expression. price should be evaluated at the callee context while x should be evaluated at the caller context. And how Python can know what is what?

Thank you for your question! It would depend on the implementation of DataFrame.__getitem__. Note that BoundExpression is endowed with locals and globals of the callee. So, it does have access to x in your example. I think the way that data.table in R handles this is that before evaluating the expression, __getitem__ simply adds columns to locals and then evaluates the expression. In your example, x already exists in locals, but the price doesn't. So, __getitem__ adds it to locals and so everything's there to evaluate the expression correctly. I think this feature is called "non-standard evaluation" because it lets programmers evaluate expressions in a context other than the standard context. On 7/14/19, 4:15 PM, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 14.07.19 07:06, Andrew Barnert via Python-ideas пише: > (Re-sending, because this was originally a reply to an off-list message > by Nima Hamidi) > > On Jul 13, 2019, at 14:12, Nima Hamidi > <hamidi@stanford.edu > <mailto:hamidi@stanford.edu>> wrote: >> >> Sometimes it's necessary not to evaluate the expression. Two such >> applications of NSE in R are as follows: >> >> 1. Data-tables have cleaner syntax. For example, letting dt be a >> data-table with a column called price, one can retrieve items cheaper >> than $1 using the following: dt [price < 1]. Pandas syntax requires >> something like dt[dt.price < 1]. This is currently inevitable as the >> expression is evaluated *before* __getitem__ is invoked. Using NSE, >> dt.__getitem__ can, first, add its columns to locals() dictionary and >> then evaluate the expression in the new context. >> > > This one looks good. I can also imagine it being useful for SQLAlchemy, > appscript, etc. just as it is for Pandas. > > But in your proposal, wouldn’t this have to be written as dt[`price < > 1`]? I think the cost of putting the expression in ticks is at least as > bad as the cost of naming the dt. > > Also: dt.price < 1 is a perfectly valid expression, with a useful value. > You can store it in a temporary variable to avoid repeating it, or stash > it for later, or print it out to see what’s happening. But price < 1 on > its own is a NameError, and I’m not sure what `price < 1` is worth on > its own. Would this invite code that’s hard to refactor and even harder > to debug? The more interesting problem is that in general case you have not simple `price < 1`, but `price < x` where x is a variable or more complex expression. price should be evaluated at the callee context while x should be evaluated at the caller context. And how Python can know what is what? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FEWKJS... Code of Conduct: http://python.org/psf/codeofconduct/

14.07.19 23:20, Nima Hamidi пише:
Thank you for your question! It would depend on the implementation of DataFrame.__getitem__. Note that BoundExpression is endowed with locals and globals of the callee. So, it does have access to x in your example. I think the way that data.table in R handles this is that before evaluating the expression, __getitem__ simply adds columns to locals and then evaluates the expression. In your example, x already exists in locals, but the price doesn't. So, __getitem__ adds it to locals and so everything's there to evaluate the expression correctly. I think this feature is called "non-standard evaluation" because it lets programmers evaluate expressions in a context other than the standard context.
The problem with this is that you should know all column names to avoid conflicts, even if you do not use them. If new columns be added which conflict with your locals you could silently get an unexpected result. This is as bad as using a star import which overrides your globals or locals. It would be better to mark either free or bound variables explicitly. For example, dt[\price < x].

On Jul 15, 2019, at 01:27, Serhiy Storchaka <storchaka@gmail.com> wrote:
The feature as described allows the library to do whatever it wants with the namespaces, and letting locals take priority over columns, or raising an exception if there’s an ambiguity, are just as easy as letting columns take priority over locals. If one of those options is clearly better, then libraries like Pandas or SQLAlchemy or whatever are going to implement the better one, not the worse one.
It would be better to mark either free or bound variables explicitly. For example, dt[\price < x].
At that point I think you’re better off with the existing syntax, dt[dt.price < x]. When you want to explicitly specify a namespace, that’s what dot syntax already means. Consider the case where dt is a join of two tables d1 and d2. Today you can write dt[d1.price * d2.taxrate < x]. With the proposed new feature, you could presumably write dt[price * taxrate < x], and get an exception if, say, both tables have price columns, but otherwise get exactly what you expected. I assume you think that’s too unclear or magical or whatever? But then I’m not sure how dt[\price * \taxrate < x] is much better.

On Mon, Jul 15, 2019 at 7:17 PM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Consider the case where dt is a join of two tables d1 and d2. Today you can write dt[d1.price * d2.taxrate < x]. With the proposed new feature, you could presumably write dt[price * taxrate < x], and get an exception if, say, both tables have price columns, but otherwise get exactly what you expected. I assume you think that’s too unclear or magical or whatever? But then I’m not sure how dt[\price * \taxrate < x] is much better.
I'm not 100% sure how joins work in Pandas, but wouldn't it just be dt[dt.price * dt.taxrate < x] ? Once they're joined, you'd just reference columns from the combined table, surely? ChrisA

On Jul 14, 2019, at 13:13, Serhiy Storchaka <storchaka@gmail.com> wrote:
The more interesting problem is that in general case you have not simple `price < 1`, but `price < x` where x is a variable or more complex expression.
I don’t think this one is a problem. (I mean, it does demonstrate the problem I was talking about, that price<1 is a useful, refactorable, etc. value but `price<1` is not, but you don’t need x for that.) I think this is exactly the kind of case the OP was referring to with “eval in a modified environment”, and it works fine.
price should be evaluated at the callee context
It’s actually not even the callee context, but a custom one, maybe something like self.columns. But that isn’t a problem.
while x should be evaluated at the caller context. And how Python can know what is what?
I think the benefit of this proposal is that Python doesn’t _need_ to know what is what; that’s up to the person implementing pandas.DataFrame.__getitem__, and it should be not just possible but easy to implement that as desired. As far as Python is concerned, `price < x` just gets compiled to an AST and the caller context gets bound to that AST. The callee code then gets to decide how to eval it. The OP didn’t explain exactly how this would be done, but it seems like it should be easy: def __ getitem__(self, key): if isinstance(key, BoundExpression): localcontext = ChainMap(self.columns, key.locals) key = eval(key, localcontext, key.globals) # … existing __getitem__ code from here on Since price is in columns and x is in key.locals, they’re both in localcontext, and everything works. And if you (the author of pandas) want locals to take precedence over columns, or want to flag it as an error to use something that’s ambiguous like that, or… any more complicated thing I can come up with, they’re all just as easy to write.

On Jul 14, 2019, at 13:46, Nima Hamidi <hamidi@stanford.edu> wrote:
That’s not a fair example, because you’re ignoring the dot syntax that Pandas already provides, and also leaving out the backticks. So it’s really: tips[(tips.size >= 5) | (tips.total_bill > 45)] tips[`(size >= 5) | (total_bill > 45)`] So, while there is still some advantage, it’s not nearly as big. And again, the tradeoff is that you don’t have useful intermediate values anymore. For example, if I want to use tips.size >= 5 repeatedly, or print it out for debugging before using it, etc., I can just do hightips = tips.size >= 5. There’s no way to do the same thing with your version.
I know the difference between function pipelines and function composition, but if nobody was interested in adding an operator for the even simpler compose, I think it’s unlikely that anyone will be interested in adding an operator for pipeline. And certainly not something that looks like %>%. So, this example doesn’t really help sell your proposal. Also, you didn’t answer any of the other issues that have nothing to do with that comparison with @ for compose. Why do you want to make people spell “feed to g” as “feed to g()”? Why shouldn’t it create a function that can be called (or otherwise used) normally? And so on.

Andrew Barnert wrote:
Also, I disagree that there is no way to get intermediate values. Data-frame can simply have a function like "get" that evaluates its argument in the data-frame's context and _returns_ the value instead of _subsetting_ the data-frame with respect to that.

14.07.19 07:06, Andrew Barnert via Python-ideas пише:
The more interesting problem is that in general case you have not simple `price < 1`, but `price < x` where x is a variable or more complex expression. price should be evaluated at the callee context while x should be evaluated at the caller context. And how Python can know what is what?

Thank you for your question! It would depend on the implementation of DataFrame.__getitem__. Note that BoundExpression is endowed with locals and globals of the callee. So, it does have access to x in your example. I think the way that data.table in R handles this is that before evaluating the expression, __getitem__ simply adds columns to locals and then evaluates the expression. In your example, x already exists in locals, but the price doesn't. So, __getitem__ adds it to locals and so everything's there to evaluate the expression correctly. I think this feature is called "non-standard evaluation" because it lets programmers evaluate expressions in a context other than the standard context. On 7/14/19, 4:15 PM, "Serhiy Storchaka" <storchaka@gmail.com> wrote: 14.07.19 07:06, Andrew Barnert via Python-ideas пише: > (Re-sending, because this was originally a reply to an off-list message > by Nima Hamidi) > > On Jul 13, 2019, at 14:12, Nima Hamidi > <hamidi@stanford.edu > <mailto:hamidi@stanford.edu>> wrote: >> >> Sometimes it's necessary not to evaluate the expression. Two such >> applications of NSE in R are as follows: >> >> 1. Data-tables have cleaner syntax. For example, letting dt be a >> data-table with a column called price, one can retrieve items cheaper >> than $1 using the following: dt [price < 1]. Pandas syntax requires >> something like dt[dt.price < 1]. This is currently inevitable as the >> expression is evaluated *before* __getitem__ is invoked. Using NSE, >> dt.__getitem__ can, first, add its columns to locals() dictionary and >> then evaluate the expression in the new context. >> > > This one looks good. I can also imagine it being useful for SQLAlchemy, > appscript, etc. just as it is for Pandas. > > But in your proposal, wouldn’t this have to be written as dt[`price < > 1`]? I think the cost of putting the expression in ticks is at least as > bad as the cost of naming the dt. > > Also: dt.price < 1 is a perfectly valid expression, with a useful value. > You can store it in a temporary variable to avoid repeating it, or stash > it for later, or print it out to see what’s happening. But price < 1 on > its own is a NameError, and I’m not sure what `price < 1` is worth on > its own. Would this invite code that’s hard to refactor and even harder > to debug? The more interesting problem is that in general case you have not simple `price < 1`, but `price < x` where x is a variable or more complex expression. price should be evaluated at the callee context while x should be evaluated at the caller context. And how Python can know what is what? _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FEWKJS... Code of Conduct: http://python.org/psf/codeofconduct/

14.07.19 23:20, Nima Hamidi пише:
Thank you for your question! It would depend on the implementation of DataFrame.__getitem__. Note that BoundExpression is endowed with locals and globals of the callee. So, it does have access to x in your example. I think the way that data.table in R handles this is that before evaluating the expression, __getitem__ simply adds columns to locals and then evaluates the expression. In your example, x already exists in locals, but the price doesn't. So, __getitem__ adds it to locals and so everything's there to evaluate the expression correctly. I think this feature is called "non-standard evaluation" because it lets programmers evaluate expressions in a context other than the standard context.
The problem with this is that you should know all column names to avoid conflicts, even if you do not use them. If new columns be added which conflict with your locals you could silently get an unexpected result. This is as bad as using a star import which overrides your globals or locals. It would be better to mark either free or bound variables explicitly. For example, dt[\price < x].

On Jul 15, 2019, at 01:27, Serhiy Storchaka <storchaka@gmail.com> wrote:
The feature as described allows the library to do whatever it wants with the namespaces, and letting locals take priority over columns, or raising an exception if there’s an ambiguity, are just as easy as letting columns take priority over locals. If one of those options is clearly better, then libraries like Pandas or SQLAlchemy or whatever are going to implement the better one, not the worse one.
It would be better to mark either free or bound variables explicitly. For example, dt[\price < x].
At that point I think you’re better off with the existing syntax, dt[dt.price < x]. When you want to explicitly specify a namespace, that’s what dot syntax already means. Consider the case where dt is a join of two tables d1 and d2. Today you can write dt[d1.price * d2.taxrate < x]. With the proposed new feature, you could presumably write dt[price * taxrate < x], and get an exception if, say, both tables have price columns, but otherwise get exactly what you expected. I assume you think that’s too unclear or magical or whatever? But then I’m not sure how dt[\price * \taxrate < x] is much better.

On Mon, Jul 15, 2019 at 7:17 PM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Consider the case where dt is a join of two tables d1 and d2. Today you can write dt[d1.price * d2.taxrate < x]. With the proposed new feature, you could presumably write dt[price * taxrate < x], and get an exception if, say, both tables have price columns, but otherwise get exactly what you expected. I assume you think that’s too unclear or magical or whatever? But then I’m not sure how dt[\price * \taxrate < x] is much better.
I'm not 100% sure how joins work in Pandas, but wouldn't it just be dt[dt.price * dt.taxrate < x] ? Once they're joined, you'd just reference columns from the combined table, surely? ChrisA

On Jul 14, 2019, at 13:13, Serhiy Storchaka <storchaka@gmail.com> wrote:
The more interesting problem is that in general case you have not simple `price < 1`, but `price < x` where x is a variable or more complex expression.
I don’t think this one is a problem. (I mean, it does demonstrate the problem I was talking about, that price<1 is a useful, refactorable, etc. value but `price<1` is not, but you don’t need x for that.) I think this is exactly the kind of case the OP was referring to with “eval in a modified environment”, and it works fine.
price should be evaluated at the callee context
It’s actually not even the callee context, but a custom one, maybe something like self.columns. But that isn’t a problem.
while x should be evaluated at the caller context. And how Python can know what is what?
I think the benefit of this proposal is that Python doesn’t _need_ to know what is what; that’s up to the person implementing pandas.DataFrame.__getitem__, and it should be not just possible but easy to implement that as desired. As far as Python is concerned, `price < x` just gets compiled to an AST and the caller context gets bound to that AST. The callee code then gets to decide how to eval it. The OP didn’t explain exactly how this would be done, but it seems like it should be easy: def __ getitem__(self, key): if isinstance(key, BoundExpression): localcontext = ChainMap(self.columns, key.locals) key = eval(key, localcontext, key.globals) # … existing __getitem__ code from here on Since price is in columns and x is in key.locals, they’re both in localcontext, and everything works. And if you (the author of pandas) want locals to take precedence over columns, or want to flag it as an error to use something that’s ambiguous like that, or… any more complicated thing I can come up with, they’re all just as easy to write.

On Jul 14, 2019, at 13:46, Nima Hamidi <hamidi@stanford.edu> wrote:
That’s not a fair example, because you’re ignoring the dot syntax that Pandas already provides, and also leaving out the backticks. So it’s really: tips[(tips.size >= 5) | (tips.total_bill > 45)] tips[`(size >= 5) | (total_bill > 45)`] So, while there is still some advantage, it’s not nearly as big. And again, the tradeoff is that you don’t have useful intermediate values anymore. For example, if I want to use tips.size >= 5 repeatedly, or print it out for debugging before using it, etc., I can just do hightips = tips.size >= 5. There’s no way to do the same thing with your version.
I know the difference between function pipelines and function composition, but if nobody was interested in adding an operator for the even simpler compose, I think it’s unlikely that anyone will be interested in adding an operator for pipeline. And certainly not something that looks like %>%. So, this example doesn’t really help sell your proposal. Also, you didn’t answer any of the other issues that have nothing to do with that comparison with @ for compose. Why do you want to make people spell “feed to g” as “feed to g()”? Why shouldn’t it create a function that can be called (or otherwise used) normally? And so on.

Andrew Barnert wrote:
Also, I disagree that there is no way to get intermediate values. Data-frame can simply have a function like "get" that evaluates its argument in the data-frame's context and _returns_ the value instead of _subsetting_ the data-frame with respect to that.
participants (4)
-
Andrew Barnert
-
Chris Angelico
-
Nima Hamidi
-
Serhiy Storchaka