[Python-ideas] History on proposals for Macros?

Joonas Liik liik.joonas at gmail.com
Tue Mar 31 09:22:11 CEST 2015

isn't this:
    lambda df: mean(df.arr_delay)
the same as..
    functools.partial(mean, df.arr_delay)

I kind of like the idea of a pipe operator (tho %>% looks just terrible and
| is already taken)
But consider: if we could get functools.compose that too would be
    (something like
https://mathieularose.com/function-composition-in-python/  )

On 31 March 2015 at 09:21, Stephan Hoyer <shoyer at gmail.com> wrote:

> Macros would be an extremely useful feature for pandas, the main data
> analysis library for Python (for which I'm a core developer).
> Why? Well, right now, R has better syntax than Python for writing data
> analysis code. The difference comes down to two macros that R developers
> have written within the past few years.
> Here's an example borrowed from the documentation for the dplyr R package
> [1]:
> flights %>%
>   group_by(year, month, day) %>%
>   select(arr_delay, dep_delay) %>%
>   summarise(
>     arr = mean(arr_delay),
>     dep = mean(dep_delay)
>   ) %>%
>   filter(arr > 30 | dep > 30)
> Here "flights" is a dataframe, similar to a table in spreadsheet. It is
> also the only global variables in the analysis -- variables like "year" and
> "arr_delay" are actually columns in the dataframe. R evaluates variables
> lazily, in the context of the provided frame. In Python, functions like
> groupby_by would need to be macros.
> The other macro is the "pipe" or chaining operator %>%. This operator is
> used to avoid the need many temporary or highly nested expressions. The
> result is quite readable, but again, it needs to be a macro, because
> group_by and filter are simply functions that take a dataframe as their
> first argument. The fact that chaining works with plain functions means
> that it works even on libraries that weren't designed for it. We could do
> function chaining in Python by abusing an exist binary operator like >> or
> |, but all the objects on which it works would need to be custom types.
> What does this example look using pandas? Well, it's not as nice, and
> there's not much we can do about it because of the limitations of Python
> syntax:
> (flights
>  .group_by('year', 'month', 'day')
>  .select('arr_delay', 'dep_delay')
>  .summarize(
>     arr = lambda df: mean(df.arr_delay)),
>     dep = lambda df: mean(df.dep_delay)))
>  .filter(lambda df: (df.arr > 30) | (df.dep > 30)))
> (Astute readers will note that I've taken a few liberties with pandas
> syntax to make more similar to dplyr.)
> Instead of evaluating expressions in the delayed context of a dataframes,
> we use strings or functions. With all the lambdas there's a lot more noise
> than the R example, and it's harder to keep track of what's on. In
> principle we could simplify the lambda expressions to not use any arguments
> (Matthew linked to the GitHub comment where I showed what that would look
> like [2]), but the code remains awkwardly verbose.
> For chaining, instead of using functions and the pipe operator, we use
> methods. This works fine as long as users are only using pandas, but it
> means that unlike R, the Python dataframe is a closed ecosystem. Python
> developers (rightly) frown upon monkey-patching, so there's no way for
> external libraries to add their own functions (e.g., for custom plotting or
> file formats) on an equal footing to the methods built-in to pandas.
> I hope these use cases are illustrative. I don't have strong opinions on
> the technical merits of particular proposals. The "light lambda" syntax
> described by Andrew Barnert would at least solve the delayed evaluation
> use-case nicely, though the colon character is not ideal because it would
> rule out using light lambdas inside indexing brackets.
> Best,
> Stephan
> [1]
> http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#chaining
> [2] https://github.com/pydata/pandas/issues/9229#issuecomment-69691738
