Pandas: How does df.apply(lambda work to create a result

Thu May 27 20:45:04 EDT 2021

Disclaimer: I haven't actually used pandas.

On 26May2021 14:45, Veek M <veekm at foo.com> wrote:
>t = pd.DataFrame([[4,9],]*3, columns=['a', 'b'])
>   a  b
>0  4  9
>1  4  9
>2  4  9

I presume you've printed "t" here. So the above table is str(t). Or 
possibly repr(t) if you were at the interactive prompt. It is a human 
readable printout of "t".

>t.apply(lambda x: [x]) gives
>a    [[1, 2, 2]]
>b    [[1, 2, 2]]
>How?? When you 't' within console the entire data frame is dumped but how are
>the individual elements passed into .apply()?

The doco for .apply seems to be here:

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

When you go "t.apply(....)", the class implementing "t" has its .apply() 
method called - this is what does the work. So "t" is a DataFrame, so 
you're calling DataFrame.apply as documented at the above link.

From the output, I expect that it takes each row in the DataFrame and 
passed it to the lambda function, and produces a single column value 
from the result, in the end creating a new single column DataFrame. The 
docs suggets you can do more than that, also.

>I can't do lambda x,y: [x,y]
>because only 1 arg is passed so how does [4] generate [[ ]]

Because your lambda:

    lambda x: [x]

is passed the whole row, which is a list. You're returning a single 
element list containing that list.

If you know the rows have exactly 2 values you could do this:

    lambda x: [x[0]*2, x[1]*3]

to get the first column multiplied by 2 and the second by 3.

You might do better to write your lambda like this:

    lambda row: [row]

just so that it is clear that you're getting the whole row rather than 
some single element from the row.

>Also - this:
> t.apply(lambda x: [x], axis=1)
>0    [[139835521287488, 139835521287488]]
>1    [[139835521287488, 139835521287488]]
>2    [[139835521287488, 139835521287488]]
>vey weird - what just happened??

See the docs above for the effect the axis parameter has on _how_ .apply 
does its work.

>In addition, how do I filter data eg:  t[x] = t[x].apply(lambda x: x*72.72) I'd
>like to remove numeric -1 contained in the column output of t[x]. 'filter' only
>works with labels of indices, i can't do t[ t[x] != -1 ] because that will then
>generate all the rows and I have no idea how that will be translate to
>within a .apply(lambda x... (hence my Q on what's going on internally)

It looks like the .fliter method accepts an items parameter indicating 
which axis labels to keep. Use axis=0 to filter on the rows instead of 
the columns. Maybe something shaped like this?

    t.filter(axis=0, items=[
        label for label in t.labels if t[label][0] != -1
        ]).apply(.....)

That looks pretty cumbersome and also requires a way to get the labels 
of "t", which I just made up as "t.labels". And I'm just guessing that 
"t[label][0]" might get you the cell value you want to test against -1.

I expect there's a cleaner way to do this.

>(could someone also tell me briefly the best way to use NNTP and filter
>out the SPAM - 'pan' and 'tin' don't work anymore afaik
>[eternal-september]  and I'm using slrn currently - the SLang regex is
>weird within the kill file - couldn't get it to work - wound up killing
>everything when I did
>Subject: [A-Z][A-Z][A-Z]+
>)

I confess I subscribe to the python-list mailing list, not the 
newsgroup. It has much much less spam, and the two are gatewayed so you 
can particpate either way. For example, you've posted to the newsgroup 
and I'm seeing your post in the mailing list. Likewise my reply will be 
going to the mailing list and copied to the newsgroup.

Come on over to the mailing list. It is rumoured to be much quieter.

Cheers,
Cameron Simpson <cs at cskk.id.au>