Chained filtering with lazy evaluation ("where")
Dear pandas devs, like most (I think) of you, I love how pandas supports chained assignments. And like several other users, I get frustrated when I have to break some chained sequence of calls because a given operation cannot be included. See for instance https://stackoverflow.com/q/11869910/2858145 https://stackoverflow.com/q/40028500/2858145 https://stackoverflow.com/q/44912692/2858145 I ended up noticing that most of the time, the problematic operation is a filtering, since it is typically done as df.loc[condition_on(df)] e.g. df.loc[df['a'] > 3] In R, we would do (something more similar to) df.loc[a>3] ... but we can't in Python syntax. This is not usually a huge deal - one could even claim that "df[df['a'] > 3]" is nicer because it's more explicit. Still, when it's not df but rather a 5 lines chained assignment, one needs to create the df, and then filter it, which is annoying. There are a couple of other solutions: df.filter, adding an ad-hoc method to pandas objects... but I never found any of them general and/or pythonic enough. So I tried with an alternative: lazy evaluation. It took relatively few lines of code, and after some weeks of use, I'm really satisfied of the result: https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda s/where.py (do not bother about the rest of the repo, the file works as a standalone module). This allows to replace df.loc[df['a'] > 2] with df.loc[W['a'] > 2] ... and to apply virtually any operation one would apply to df (more precisely, any operation... which is chainable).¹ As a bonus, one can write a condition and reuse it to filter several pandas objects. I'm writing this email to ask: - whether you have in mind some alternative solution I did not consider to the problem of "unchainable filterings" - whether you have suggestions on how to improve my solution - whether you think this is worth merging in pandas (the amount of monkey patching required is so small that it is not burdensome to keep it separated - it just means one more dependency for users who want to use it) For the records: it currently works only in .loc... and I don't expect this to change: I guess pd.{Series,DataFrame}.__getitem__ already support too many different mechanisms. Supporting .loc as setter should be instead pretty straightforward - it is just lower priority as not used in chaining. Pietro ¹ Only exception (I know of) at the moment: W.loc(axis=1)[.] won't work, because I "taught" it that "loc" is not a callable. Shouldn't be hard to fix.
FYI, `df.loc[lambda x: x['a'] > 3]` is valid. loc takes a callable, and evaluates it with the NDFrame as the first (only) argument. So the downside is now that `lambda x:` is a bit more to type that `W`, but it's not so bad. And if you have a pre-defined method for filtering, it's `df.loc[condition_on]`, which is the shortest (but maybe not clearest) way of spelling that. - Tom On Thu, Mar 15, 2018 at 12:36 PM, Pietro Battiston <ml@pietrobattiston.it> wrote:
Dear pandas devs,
like most (I think) of you, I love how pandas supports chained assignments.
And like several other users, I get frustrated when I have to break some chained sequence of calls because a given operation cannot be included. See for instance https://stackoverflow.com/q/11869910/2858145 https://stackoverflow.com/q/40028500/2858145 https://stackoverflow.com/q/44912692/2858145
I ended up noticing that most of the time, the problematic operation is a filtering, since it is typically done as
df.loc[condition_on(df)]
e.g.
df.loc[df['a'] > 3]
In R, we would do (something more similar to)
df.loc[a>3]
... but we can't in Python syntax. This is not usually a huge deal - one could even claim that "df[df['a'] > 3]" is nicer because it's more explicit. Still, when it's not df but rather a 5 lines chained assignment, one needs to create the df, and then filter it, which is annoying.
There are a couple of other solutions: df.filter, adding an ad-hoc method to pandas objects... but I never found any of them general and/or pythonic enough. So I tried with an alternative: lazy evaluation. It took relatively few lines of code, and after some weeks of use, I'm really satisfied of the result: https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda s/where.py (do not bother about the rest of the repo, the file works as a standalone module).
This allows to replace df.loc[df['a'] > 2] with df.loc[W['a'] > 2] ... and to apply virtually any operation one would apply to df (more precisely, any operation... which is chainable).¹ As a bonus, one can write a condition and reuse it to filter several pandas objects.
I'm writing this email to ask: - whether you have in mind some alternative solution I did not consider to the problem of "unchainable filterings" - whether you have suggestions on how to improve my solution - whether you think this is worth merging in pandas (the amount of monkey patching required is so small that it is not burdensome to keep it separated - it just means one more dependency for users who want to use it)
For the records: it currently works only in .loc... and I don't expect this to change: I guess pd.{Series,DataFrame}.__getitem__ already support too many different mechanisms.
Supporting .loc as setter should be instead pretty straightforward - it is just lower priority as not used in chaining.
Pietro
¹ Only exception (I know of) at the moment: W.loc(axis=1)[.] won't work, because I "taught" it that "loc" is not a callable. Shouldn't be hard to fix. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Il giorno gio, 15/03/2018 alle 12.58 -0500, Tom Augspurger ha scritto:
FYI, `df.loc[lambda x: x['a'] > 3]` is valid. loc takes a callable, and evaluates it with the NDFrame as the first (only) argument.
Aha! I knew one could pass callables, but I had mistakenly assumed the mechanism was analogous to df.apply(), i.e. accepting rows/elements rather than the NDFrame itself. I think I like my solution better... but for sure adding it to pandas would duplicate an already present functionality. Thanks (to you and Chris) for the pointer, Pietro
If you're not aware, we do have one (IMO ugly) solution for this using lambdas function_making_df().loc[lambda x: x['a'] > 3] There is also some prior art of pandas_ply[1] and dplython [2]. I had a WIP PR adding a version of pandas_ply to pandas [3], but never finished it out, there was some concern about the API expansion. I was ultimately using a lambda as the delivery mechanism. I am in favor of the general concept, though wonder if there is a better long term solution around expansion of the python language for some kind of light macro support and/or or a fully delayed expression system, ala ibis. [1] - https://github.com/coursera/pandas-ply [2] - https://github.com/dodger487/dplython [3] - https://github.com/pandas-dev/pandas/pull/14209 On Thu, Mar 15, 2018 at 12:36 PM, Pietro Battiston <ml@pietrobattiston.it> wrote:
Dear pandas devs,
like most (I think) of you, I love how pandas supports chained assignments.
And like several other users, I get frustrated when I have to break some chained sequence of calls because a given operation cannot be included. See for instance https://stackoverflow.com/q/11869910/2858145 https://stackoverflow.com/q/40028500/2858145 https://stackoverflow.com/q/44912692/2858145
I ended up noticing that most of the time, the problematic operation is a filtering, since it is typically done as
df.loc[condition_on(df)]
e.g.
df.loc[df['a'] > 3]
In R, we would do (something more similar to)
df.loc[a>3]
... but we can't in Python syntax. This is not usually a huge deal - one could even claim that "df[df['a'] > 3]" is nicer because it's more explicit. Still, when it's not df but rather a 5 lines chained assignment, one needs to create the df, and then filter it, which is annoying.
There are a couple of other solutions: df.filter, adding an ad-hoc method to pandas objects... but I never found any of them general and/or pythonic enough. So I tried with an alternative: lazy evaluation. It took relatively few lines of code, and after some weeks of use, I'm really satisfied of the result: https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda s/where.py (do not bother about the rest of the repo, the file works as a standalone module).
This allows to replace df.loc[df['a'] > 2] with df.loc[W['a'] > 2] ... and to apply virtually any operation one would apply to df (more precisely, any operation... which is chainable).¹ As a bonus, one can write a condition and reuse it to filter several pandas objects.
I'm writing this email to ask: - whether you have in mind some alternative solution I did not consider to the problem of "unchainable filterings" - whether you have suggestions on how to improve my solution - whether you think this is worth merging in pandas (the amount of monkey patching required is so small that it is not burdensome to keep it separated - it just means one more dependency for users who want to use it)
For the records: it currently works only in .loc... and I don't expect this to change: I guess pd.{Series,DataFrame}.__getitem__ already support too many different mechanisms.
Supporting .loc as setter should be instead pretty straightforward - it is just lower priority as not used in chaining.
Pietro
¹ Only exception (I know of) at the moment: W.loc(axis=1)[.] won't work, because I "taught" it that "loc" is not a callable. Shouldn't be hard to fix. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Il giorno gio, 15/03/2018 alle 13.03 -0500, Chris Bartak ha scritto:
[...] There is also some prior art of pandas_ply[1] and dplython [2]. I had a WIP PR adding a version of pandas_ply to pandas [3], but never finished it out, there was some concern about the API expansion. I was ultimately using a lambda as the delivery mechanism.
I am in favor of the general concept, though wonder if there is a better long term solution around expansion of the python language for some kind of light macro support and/or or a fully delayed expression system, ala ibis.
[1] - https://github.com/coursera/pandas-ply [2] - https://github.com/dodger487/dplython [3] - https://github.com/pandas-dev/pandas/pull/14209
Funny... basically the same API, completely different implementation. I guess I will steal the idea to make it callable (avoiding monkey- patching, and supporting assign())... thanks! Pietro
I might be missing the point but can you use .pipe()? In [1]: df = pd.util.testing.makeTimeDataFrame() In [2]: df Out[2]: A B C D 2000-01-03 -0.870800 0.517496 -1.129341 1.074059 2000-01-04 -0.102295 1.811238 -2.080829 -1.145249 2000-01-05 -0.608380 -0.754805 1.196582 1.480967 2000-01-06 0.358763 -0.929273 0.190293 0.191154 2000-01-07 1.984208 0.579810 -0.369664 1.583910 ... ... ... ... ... 2000-02-07 0.917228 -0.200213 0.893922 -0.960147 2000-02-08 0.490313 0.728865 -0.978162 1.028735 2000-02-09 1.415720 -0.855196 1.868628 -0.247138 2000-02-10 0.613818 0.488457 -1.042366 -1.831410 2000-02-11 -1.433825 0.062954 -0.856178 -0.273247 [30 rows x 4 columns] In [3]: df.pipe(lambda x: x[x.A > .1]) Out[3]: A B C D 2000-01-06 0.358763 -0.929273 0.190293 0.191154 2000-01-07 1.984208 0.579810 -0.369664 1.583910 2000-01-10 0.872874 -1.378924 0.644806 0.988295 2000-01-11 0.252953 -0.181655 0.049428 0.545417 2000-01-13 0.602725 -0.221286 -0.208824 -0.913126 ... ... ... ... ... 2000-02-04 0.319361 -0.664777 -0.460101 0.111564 2000-02-07 0.917228 -0.200213 0.893922 -0.960147 2000-02-08 0.490313 0.728865 -0.978162 1.028735 2000-02-09 1.415720 -0.855196 1.868628 -0.247138 2000-02-10 0.613818 0.488457 -1.042366 -1.831410 [17 rows x 4 columns] In [4]: df.pipe(lambda x: x[x.A > .1]).pipe(lambda x: x[x.B > .1]) Out[4]: A B C D 2000-01-07 1.984208 0.579810 -0.369664 1.583910 2000-01-25 0.724618 2.134328 0.269921 1.633488 2000-01-26 1.011798 0.989021 -1.472997 0.849001 2000-02-02 0.300020 0.490800 1.786019 1.389062 2000-02-03 0.729878 0.341635 -0.972437 -0.670142 2000-02-08 0.490313 0.728865 -0.978162 1.028735 2000-02-10 0.613818 0.488457 -1.042366 -1.831410 In [5]: On Thu, Mar 15, 2018 at 2:39 PM, Pietro Battiston <ml@pietrobattiston.it> wrote:
Il giorno gio, 15/03/2018 alle 13.03 -0500, Chris Bartak ha scritto:
[...] There is also some prior art of pandas_ply[1] and dplython [2]. I had a WIP PR adding a version of pandas_ply to pandas [3], but never finished it out, there was some concern about the API expansion. I was ultimately using a lambda as the delivery mechanism.
I am in favor of the general concept, though wonder if there is a better long term solution around expansion of the python language for some kind of light macro support and/or or a fully delayed expression system, ala ibis.
[1] - https://github.com/coursera/pandas-ply [2] - https://github.com/dodger487/dplython [3] - https://github.com/pandas-dev/pandas/pull/14209
Funny... basically the same API, completely different implementation. I guess I will steal the idea to make it callable (avoiding monkey- patching, and supporting assign())... thanks!
Pietro _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto:
I might be missing the point but can you use .pipe()?
Indeed, this is something else I had not considered. However I don't like it to much. Compare .loc[W] with .pipe(lambda df : df[df]) By the way, .loc[lambda df : df[df]] is equivalent but cleaner to me (after all, we are selecting). This said, the solutions proposed by you and Chris are indeed more robust then mine. For instance, .loc[W + 1 > 2] works but .loc[2 < 1 + W] doesn't, and I don't even know if a fix is possible. Pietro
If you feel like being evil, you can use a so-called “frame hack” + a context manager: In [1]: import pandas as pd ...: import contextlib ...: import sys ...: ...: ...: class ctx: ...: def __init__(self, df): ...: self.df = df ...: current_frame = sys._getframe(0) ...: self.locals = current_frame.f_back.f_locals ...: self.existing_values = { ...: k: self.locals[k] for k in df.columns ...: if k in self.locals ...: } ...: self.new_values = {k for k in df.columns if k not in self.locals} ...: ...: def __enter__(self): ...: for k in df.columns: ...: self.locals[k] = df[k] ...: return ...: ...: def __exit__(self, *exc): ...: self.locals.update(self.existing_values) ...: for k in self.new_values: ...: del self.locals[k] ...: In [2]: df = pd.DataFrame({'a': np.array([1, 2], dtype='float32')}) In [3]: try: ...: a + 1 ...: except NameError: ...: print("'a' doesn't exist yet!") ...: 'a' doesn't exist yet! In [4]: with ctx(df): ...: print(df[a == 1]) ...: a 0 1.0 In [5]: try: ...: a + 1 ...: except NameError: ...: print("'a' doesn't exist yet!") ...: 'a' doesn't exist yet! On Thu, Mar 22, 2018 at 10:35 AM Pietro Battiston <ml@pietrobattiston.it> wrote:
Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto:
I might be missing the point but can you use .pipe()?
Indeed, this is something else I had not considered.
However I don't like it to much. Compare
.loc[W]
with
.pipe(lambda df : df[df])
By the way,
.loc[lambda df : df[df]]
is equivalent but cleaner to me (after all, we are selecting).
This said, the solutions proposed by you and Chris are indeed more robust then mine. For instance,
.loc[W + 1 > 2]
works but
.loc[2 < 1 + W]
doesn't, and I don't even know if a fix is possible.
Pietro _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Sounds like someone's been learning from David Beazley :) Now just define DataFrame.__enter__ to pass `self` to `ctx`, and write it as ``` with df: print(df[a == 1]) ``` On Thu, Mar 22, 2018 at 10:24 AM, Phillip Cloud <cpcloud@gmail.com> wrote:
If you feel like being evil, you can use a so-called “frame hack” + a context manager:
In [1]: import pandas as pd ...: import contextlib ...: import sys ...: ...: ...: class ctx: ...: def __init__(self, df): ...: self.df = df ...: current_frame = sys._getframe(0) ...: self.locals = current_frame.f_back.f_locals ...: self.existing_values = { ...: k: self.locals[k] for k in df.columns ...: if k in self.locals ...: } ...: self.new_values = {k for k in df.columns if k not in self.locals} ...: ...: def __enter__(self): ...: for k in df.columns: ...: self.locals[k] = df[k] ...: return ...: ...: def __exit__(self, *exc): ...: self.locals.update(self.existing_values) ...: for k in self.new_values: ...: del self.locals[k] ...:
In [2]: df = pd.DataFrame({'a': np.array([1, 2], dtype='float32')})
In [3]: try: ...: a + 1 ...: except NameError: ...: print("'a' doesn't exist yet!") ...: 'a' doesn't exist yet!
In [4]: with ctx(df): ...: print(df[a == 1]) ...: a 0 1.0
In [5]: try: ...: a + 1 ...: except NameError: ...: print("'a' doesn't exist yet!") ...: 'a' doesn't exist yet!
On Thu, Mar 22, 2018 at 10:35 AM Pietro Battiston <ml@pietrobattiston.it> wrote:
Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto:
I might be missing the point but can you use .pipe()?
Indeed, this is something else I had not considered.
However I don't like it to much. Compare
.loc[W]
with
.pipe(lambda df : df[df])
By the way,
.loc[lambda df : df[df]]
is equivalent but cleaner to me (after all, we are selecting).
This said, the solutions proposed by you and Chris are indeed more robust then mine. For instance,
.loc[W + 1 > 2]
works but
.loc[2 < 1 + W]
doesn't, and I don't even know if a fix is possible.
Pietro _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
participants (5)
-
Chris Bartak -
Justin Lewis -
Phillip Cloud -
Pietro Battiston -
Tom Augspurger