On Mar 31, 2015, at 00:22, Joonas Liik <liik.joonas@gmail.com> wrote:

isn't this:
    lambda df: mean(df.arr_delay)
the same as..

No, because in the first one df is a parameter (which gets the value of whatever DataFrame this is run on) while in the second it's a free variable (which just raises NameError, if you're lucky).

You could do this with operator.attrgetter and composition, of course:

    compose(mean, attrgetter('arr_delay'))

But I don't think that's more readable. Even if composition were an infix operator:

    (mean . attrgetter('arr_delay'))

I kind of like the idea of a pipe operator (tho %>% looks just terrible and | is already taken)
But consider: if we could get functools.compose that too would be alleviated.
    (something like  https://mathieularose.com/function-composition-in-python/  )

Why do you need to "get functools.compose"? If you just want the trivial version from that blog post, it's two lines that any novice can write himself. If you want one of the more complex versions that he dismisses, then there might be a better argument, but the post you linked argues against that, not for it. And if you don't trust yourself to write the two lines yourself, you can always pip install funcy or toolz or functional3 or more-functools.

As far as using compose again for piping... Well, it's backward from what you want, and it also requires you to stack up enough parens to choke a Lisp guru, not to mention all the repetitions of compose itself (that's why Haskell has infix compose and apply operators with the precedence and associativity they have, so you can avoid all the parens).

On 31 March 2015 at 09:21, Stephan Hoyer <shoyer@gmail.com> wrote:
Macros would be an extremely useful feature for pandas, the main data analysis library for Python (for which I'm a core developer).

Why? Well, right now, R has better syntax than Python for writing data analysis code. The difference comes down to two macros that R developers have written within the past few years.

Here's an example borrowed from the documentation for the dplyr R package [1]:

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
    arr = mean(arr_delay),
    dep = mean(dep_delay)
  ) %>%
  filter(arr > 30 | dep > 30)

Here "flights" is a dataframe, similar to a table in spreadsheet. It is also the only global variables in the analysis -- variables like "year" and "arr_delay" are actually columns in the dataframe. R evaluates variables lazily, in the context of the provided frame. In Python, functions like groupby_by would need to be macros.

The other macro is the "pipe" or chaining operator %>%. This operator is used to avoid the need many temporary or highly nested expressions. The result is quite readable, but again, it needs to be a macro, because group_by and filter are simply functions that take a dataframe as their first argument. The fact that chaining works with plain functions means that it works even on libraries that weren't designed for it. We could do function chaining in Python by abusing an exist binary operator like >> or |, but all the objects on which it works would need to be custom types.

What does this example look using pandas? Well, it's not as nice, and there's not much we can do about it because of the limitations of Python syntax:

 .group_by('year', 'month', 'day')
 .select('arr_delay', 'dep_delay')
    arr = lambda df: mean(df.arr_delay)),
    dep = lambda df: mean(df.dep_delay)))
 .filter(lambda df: (df.arr > 30) | (df.dep > 30)))

(Astute readers will note that I've taken a few liberties with pandas syntax to make more similar to dplyr.)

Instead of evaluating expressions in the delayed context of a dataframes, we use strings or functions. With all the lambdas there's a lot more noise than the R example, and it's harder to keep track of what's on. In principle we could simplify the lambda expressions to not use any arguments (Matthew linked to the GitHub comment where I showed what that would look like [2]), but the code remains awkwardly verbose.

For chaining, instead of using functions and the pipe operator, we use methods. This works fine as long as users are only using pandas, but it means that unlike R, the Python dataframe is a closed ecosystem. Python developers (rightly) frown upon monkey-patching, so there's no way for external libraries to add their own functions (e.g., for custom plotting or file formats) on an equal footing to the methods built-in to pandas.

I hope these use cases are illustrative. I don't have strong opinions on the technical merits of particular proposals. The "light lambda" syntax described by Andrew Barnert would at least solve the delayed evaluation use-case nicely, though the colon character is not ideal because it would rule out using light lambdas inside indexing brackets.


Python-ideas mailing list
Code of Conduct: http://python.org/psf/codeofconduct/

Python-ideas mailing list
Code of Conduct: http://python.org/psf/codeofconduct/