isn't this:lambda df: mean(df.arr_delay)
the same as..
functools.partial(mean, df.arr_delay)
I kind of like the idea of a pipe operator (tho %>% looks just terrible and | is already taken)But consider: if we could get functools.compose that too would be alleviated.(something like https://mathieularose.com/function-composition-in-python/ )
On 31 March 2015 at 09:21, Stephan Hoyer <shoyer@gmail.com> wrote:Macros would be an extremely useful feature for pandas, the main data analysis library for Python (for which I'm a core developer).Why? Well, right now, R has better syntax than Python for writing data analysis code. The difference comes down to two macros that R developers have written within the past few years.Here's an example borrowed from the documentation for the dplyr R package [1]:flights %>%group_by(year, month, day) %>%select(arr_delay, dep_delay) %>%summarise(arr = mean(arr_delay),dep = mean(dep_delay)) %>%filter(arr > 30 | dep > 30)Here "flights" is a dataframe, similar to a table in spreadsheet. It is also the only global variables in the analysis -- variables like "year" and "arr_delay" are actually columns in the dataframe. R evaluates variables lazily, in the context of the provided frame. In Python, functions like groupby_by would need to be macros.The other macro is the "pipe" or chaining operator %>%. This operator is used to avoid the need many temporary or highly nested expressions. The result is quite readable, but again, it needs to be a macro, because group_by and filter are simply functions that take a dataframe as their first argument. The fact that chaining works with plain functions means that it works even on libraries that weren't designed for it. We could do function chaining in Python by abusing an exist binary operator like >> or |, but all the objects on which it works would need to be custom types.What does this example look using pandas? Well, it's not as nice, and there's not much we can do about it because of the limitations of Python syntax:(flights.group_by('year', 'month', 'day').select('arr_delay', 'dep_delay').summarize(arr = lambda df: mean(df.arr_delay)),dep = lambda df: mean(df.dep_delay))).filter(lambda df: (df.arr > 30) | (df.dep > 30)))(Astute readers will note that I've taken a few liberties with pandas syntax to make more similar to dplyr.)Instead of evaluating expressions in the delayed context of a dataframes, we use strings or functions. With all the lambdas there's a lot more noise than the R example, and it's harder to keep track of what's on. In principle we could simplify the lambda expressions to not use any arguments (Matthew linked to the GitHub comment where I showed what that would look like [2]), but the code remains awkwardly verbose.For chaining, instead of using functions and the pipe operator, we use methods. This works fine as long as users are only using pandas, but it means that unlike R, the Python dataframe is a closed ecosystem. Python developers (rightly) frown upon monkey-patching, so there's no way for external libraries to add their own functions (e.g., for custom plotting or file formats) on an equal footing to the methods built-in to pandas.I hope these use cases are illustrative. I don't have strong opinions on the technical merits of particular proposals. The "light lambda" syntax described by Andrew Barnert would at least solve the delayed evaluation use-case nicely, though the colon character is not ideal because it would rule out using light lambdas inside indexing brackets.Best,Stephan
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/