[Pandas-dev] GroupBy Overhaul Proposal

Thu Jul 19 02:40:05 EDT 2018

Will, thanks for starting this! (I was after the sprint also thinking about
need of refactoring the groupby code :-))

Lot's of discussion has happened, and will need some time to digest it, but
I already quickly want to react on the 'apply' discussion:

IMO, apply should basically be syntactic sugar for the following:

keys = []
results = []

for name, group in df.groupby():
    res = func(group)
    result.append(res)
    keys.append(name)

pd.concat(results, keys=keys)

(much simplified of course, as when the result for each group is a Series
and not a DataFrame, the default concat is not what we want)

And I personally think it is useful having something as the above as a
general apply method for UDFs in groupby.

It is certainly true that the current apply implementation has
inconsistencies and magical behaviours, but I think we can deprecate those
instead of deprecating the full method. See
https://github.com/pandas-dev/pandas/issues/13056 for some comments about
this (eg on deprecating the magical 'transform' behaviour).

Apart from that, it still is a fact that a user who doesn't know all the
details will quickly turn to apply (rather than to agg), just because of
its name, and then having eg bad performance.
I am not directly sure how to solve this. We could maybe warn in certain
obvious cases (like apply(np.sum))? Although warnings can also become
annoying.

Joris

2018-07-18 2:01 GMT-05:00 Pietro Battiston <me at pietrobattiston.it>:

> Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
> > > In fact, my preference for keeping apply is pretty weak as long as
> > > there are alternatives that cover each of its use cases. But again,
> > > I'm
> > > not sure this is true.
> >
> > Just to clarify my position:
> >
> >       1. .apply() + UDF reducing to a scalar should be replaceable
> > with .agg() + same UDF (even though there are differences today…)
> >       2. .apply() + UDF returning Series / DataFrame / collection
> > doesn’t have anything else to cover it
>
> .transform() at least covers the case in which the shape of the chunk
> is unchanged.
>
> > But with #2 above I think its dangerous to assume that .apply can
> > always do the “right thing” with those types of inputs. We don’t make
> > any assertions about the indexing / labeling of returned Series and
> > DataFrames.
>
> There is a simple way to stop throwing magic at users, and it is to
> clearly document which cases .apply() covers (and which should be
> covered by .agg() or transform()), reflecting the actual guesswork
> taking place in the code.
> By the way, my understanding (without having looked at the code) is
> that
> UDF returns Series -> concat in a new Series
> UDF returns DataFrame -> concat in a new DataFrame
> and the guesswork mostly concerns understanding whether the new index
> is the same as the old. Am I missing anything relevant?
>
>
> Now, I would be all for suppressing a complicated function by replacing
> it with simpler ways to do the same thing. But for instance I would
> like the following to still work with groupby().something():
>
> def remove_group_outliers(group):
>     outliers = # code to identify them
>     return group[~group.index.isin(outliers)]
>
> ... and I currently don't see any way but .apply().
>
> > As far as collections are concerned I’m not sure if there will be a
> > clear answer on how to handle those assuming we start getting EAs
> > that add first-class support for those.
>
> Do you have any pointer/example? I'm missing the relation between
> collections and .apply().
>
>
> > > Unless I'm wrong, #18366 is orthgonal to what we are discussing:
> > > unnamed lambdas would remain unnamed lambdas.
> > > (And the obvious solution to my eyes is used named methods instead)
> >
> > I don’t think this is orthogonal. Your concern is valid on lambdas
> > and I don’t know what the solution there is (perhaps some kind of
> > keyword argument) but without getting tripped up on that I don’t
> > think its immediately apparent that the returned object for a
> > DataFrame with columns ‘a’, ‘b’, ‘c’ will have a single column when
> > called as follows:
> >
> >  - df.groupby(‘a’).agg(sum)
> >  - df.groupby(‘a’).agg({‘b’: sum, ‘c’: min})
> >
> > Yet the following will yield a MultiIndex column:
> >
> >  - df.groupby(‘a’).agg([sum])
> >  - df.groupby(‘a’).agg({‘b’: [sum], ‘c’: min})
>
> The rule is not very complicated either (if correctly documented), but
> anyway, the inconsistency would disappear by just having the first two
> examples also return a MultiIndex.
>
> ... and maybe provide the users a very simple way to flatten
> MultiIndexes (see below).
>
>
> > If you reduce the returned columns to “‘sum’ of ‘b’” and “‘min’ of
> > ‘c’” you can ensure that the returned columns have the same number of
> > levels regardless of call signature,
> > AND have the added bonus of not obfuscating what type of aggregation
> > was performed with the former two examples.
>
> Both can be solved through a MI, or through an Index(dtype=object)
> containing tuples.
>
> > Of course the end user may ultimately decide that they don’t like
> > those labels at all and completely override them, but that effort
> > becomes much easier if they can make guarantees around the number of
> > levels of the returned object
>
> I agree on this
>
> >  (especially if it’s just one!).
>
> ... not on that.
>
> MI (or tuples) -> arbitrary strings
>
> is much simpler/cleaner to do than
>
> arbitrary strings -> MI (or tuples)
>
> >
> > > - if, after creating all my columns, I want to e.g. select all
> > > columns
> > > that contain sums, I need to do some sort of "df[[col if
> > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]”
> >
> > Unless I am mistaken you would have to do something like
> > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum’)]” to get that
> > to work.
>
> Yeah, I had swapped the levels, it is
>
> df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum’)]
>
>
> > I don’t think that syntax really is that clean
>
> In my code I always start by defining
>
> WE = slice(None) # WhatEver
>
> and we could advertise this as a way to make the syntax shorter, but
> regardless of that, it definitely is cleaner than any string
> manipulation.
>
>
> > and it starts taking us down the path of advanced indexing for what
> > may start off to the end user as a very simple aggregation exercise.
>
> On this I agree with you. I'm all for providing
>
> - a MultiIndex.flatten() method which allows me to do
> res.columns = res.columns.flatten("{} of {}".format)
>
> - a simple way to do the above in-line (which is already being
> discussed, regardless of groupby)
>
> > [...]
> > > - it would be the only case in pandas in which we decide how to
> > > call a
> > > column on behalf of the user
> >
> > Well we have to do something to reduce ambiguity…I think a consistent
> > naming convention and dimension for the columns across all
> > invocations is strongly preferable to inserting a column level some
> > of the time.
>
> Again, I agree on this.
>
> >
> > - if one wants to allow the user to name the columns according to her
> > > taste, it's pretty simple to introduce an argument which takes a
> > > string
> > > to be .format()ted with the name of the column (or even of the
> > > method),
> > > e.g. name="Sum of {}"
> >
> > Agreed. In my head I feel like this defaults to something like
> > f”{fname} of {colname}” but gives the user potentially the option to
> > override. By default keep the same number of levels as what is being
> > passed in, though maybe None as an argument reverts to the old style
> > behavior of simply inserting a new column index level.
>
> Agree on everything but the default, again, because it is arbitrary
>
>
> > > By the way, despite some related issues, I still think tuples can
> > > be
> > > first class citizens of flat indexes. So if one doesn't like
> > > MultiIndexes, or they do not fit one's needs, ("sum", "A") can well
> > > be
> > > a label in a regular index.
> >
> > You know better than I do here, but again I don’t think it makes for
> > a good user experience to convert columns with one level into
> > multiple levels after a GroupBy operation regardless of how you could
> > subsequently access those values.
>
> Notice that I'm not talking about a MultiIndex, but about a flat index.
> But it is an inferior solution, given the API we already expose, to the
> MultiIndex.
>
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180719/20f87262/attachment-0001.html>