On Mar 30, 2015, at 23:21, Stephan Hoyer <shoyer@gmail.com> wrote:
Macros would be an extremely useful feature for pandas, the main data analysis library for Python (for which I'm a core developer).
Why? Well, right now, R has better syntax than Python for writing data analysis code. The difference comes down to two macros that R developers have written within the past few years.
Here's an example borrowed from the documentation for the dplyr R package [1]:
flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay), dep = mean(dep_delay) ) %>% filter(arr > 30 | dep > 30)
Here "flights" is a dataframe, similar to a table in spreadsheet. It is also the only global variables in the analysis -- variables like "year" and "arr_delay" are actually columns in the dataframe. R evaluates variables lazily, in the context of the provided frame. In Python, functions like groupby_by would need to be macros.
The other macro is the "pipe" or chaining operator %>%. This operator is used to avoid the need many temporary or highly nested expressions. The result is quite readable, but again, it needs to be a macro, because group_by and filter are simply functions that take a dataframe as their first argument. The fact that chaining works with plain functions means that it works even on libraries that weren't designed for it. We could do function chaining in Python by abusing an exist binary operator like >> or |, but all the objects on which it works would need to be custom types.
What does this example look using pandas? Well, it's not as nice, and there's not much we can do about it because of the limitations of Python syntax:
(flights .group_by('year', 'month', 'day') .select('arr_delay', 'dep_delay') .summarize( arr = lambda df: mean(df.arr_delay)), dep = lambda df: mean(df.dep_delay))) .filter(lambda df: (df.arr > 30) | (df.dep > 30)))
(Astute readers will note that I've taken a few liberties with pandas syntax to make more similar to dplyr.)
Instead of evaluating expressions in the delayed context of a dataframes, we use strings or functions. With all the lambdas there's a lot more noise than the R example, and it's harder to keep track of what's on. In principle we could simplify the lambda expressions to not use any arguments (Matthew linked to the GitHub comment where I showed what that would look like [2]), but the code remains awkwardly verbose.
For chaining, instead of using functions and the pipe operator, we use methods. This works fine as long as users are only using pandas, but it means that unlike R, the Python dataframe is a closed ecosystem. Python developers (rightly) frown upon monkey-patching, so there's no way for external libraries to add their own functions (e.g., for custom plotting or file formats) on an equal footing to the methods built-in to pandas.
One way around this is to provide a documented, clean method for hooking your types--e.g., a register classmethod that then makes the function appear as a method in all instances. Functionally this is the same as monkeypatching, but it looks a lot more inviting to the user. (And it also allows you to rewrite things under the covers in ways that would break direct monkeypatching, if you ever want to.) There are more examples of opening up modules this way than classes, but it's the same idea.
I hope these use cases are illustrative. I don't have strong opinions on the technical merits of particular proposals. The "light lambda" syntax described by Andrew Barnert would at least solve the delayed evaluation use-case nicely, though the colon character is not ideal because it would rule out using light lambdas inside indexing brackets.
The bare colon was just one of multiple suggestions that came up last time the idea was discussed (last February or so), and in previous discussions (going back to the long-abandoned PEP 312). In many cases, it looks very nice, but in others it's either ugly or (at least to a human) ambiguous without parens (which obviously defeat the whole point). I don't think anyone noticed the indexing issue, but someone (Nick Coghlan, I think) pointed out essentially the same issue in dict displays (e.g., for making a dynamic jump table). If you seriously want to revive that discussion--which might be worth doing, if your use cases are sufficiently different from the ones that were discussed (Tkinter and RPC server callbacks were the primary motivating case)--somewhere I have notes on all the syntaxes people suggested that I could dump on you. For the strings, another possibility is a namespace object that can effectively defer attribute lookup and have it done by the real table when you get it, as done by various ORMs, appscript, etc. Instead of this:
(flights .group_by('year', 'month', 'day')
... You write:
(flights .group_by(c.year, c.month, c.day)
That "c" is just an instance of a simple type that wraps its __getattr__ argument up so you can access it later--when get an argument to group_by that's one of those wrappers, you look it up on the table. It's the same number of characters, and looks a little more magical, but arguably it's more readable, at least once people are used to your framework. For example, here's a query expression to find broken tracks (with missing filenames) in iTunes via appscript that was used in a real application (until iTunes Match made this no longer work...): playlist.tracks[its.location == k.missing] Both its and k are special objects you import from appscript; its.__getattr__ returns a wrapper that's used to look up "location" in the members of whatever class playlist.tracks instances turn out to be at runtime, and k.missing returns a wrapper that's used to look up "missing" in the global keywords of whatever app playlist turns out to be part of at runtime. You can take this even further: if mean weren't a function that takes a column and returns its mean, but instead a function that takes a column name and returns a function that takes a table and returns the mean of the column with that name, then you could just write this:
.summarize( arr = mean(c.arr_delay)
But of course that isn't what mean is. But it can be what, say, f.mean is, if f is another one of those attribute-delaying objects: f.mean(c.arr_delay) returns a wrapper that summarize can use to call the function named "mean" on the column named "arr_delay". So, the whole thing reduces to:
(flights .group_by(c.year, c.month, c.day) .select(c.arr_delay, c.dep_delay) .summarize( arr = f.mean(c.arr_delay)), dep = f.mean(c.dep_delay))) .filter(c.arr > 30 | c.dep > 30)