help: pandas and 2d table
jak
nospam at please.ty
Mon Apr 15 02:05:18 EDT 2024
Stefan Ram ha scritto:
> jak <nospam at please.ty> wrote or quoted:
>> Stefan Ram ha scritto:
>>> df = df.where( df == 'zz' ).stack().reset_index()
>>> result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
>> Since I don't know Pandas, I will need a month at least to understand
>> these 2 lines of code. Thanks again.
>
> Here's a technique to better understand such code:
>
> Transform it into a program with small statements and small
> expressions with no more than one call per statement if possible.
> (After each litte change check that the output stays the same.)
>
> import pandas as pd
>
> # Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
> with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
> print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
> foo1,aa,ab,zz,ad,ae,af
> foo2,ba,bb,bc,bd,zz,bf
> foo3,ca,zz,cc,cd,ce,zz
> foo4,da,db,dc,dd,de,df
> foo5,ea,eb,ec,zz,ee,ef
> foo6,fa,fb,fc,fd,fe,ff''', file=out )
> # Note the "index_col=0" below, which is important here!
> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
>
> selection = df.where( df == 'zz' )
> selection_stack = selection.stack()
> df = selection_stack.reset_index()
> df0 = df.iloc[ :, 0 ]
> df1 = df.iloc[ :, 1 ]
> z = zip( df0, df1 )
> l = list( z )
> result ={ 'zz': l }
> print( result )
>
> I suggest to next insert print statements to print each intermediate
> value:
>
> # Note the "index_col=0" below, which is important here!
> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
> print( 'df = \n', type( df ), ':\n"', df, '"\n' )
>
> selection = df.where( df == 'zz' )
> print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
> selection, '"\n' )
>
> selection_stack = selection.stack()
> print( 'result of stack() = \n', type( selection_stack ), ':\n"',
> selection_stack, '"\n' )
>
> df = selection_stack.reset_index()
> print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )
>
> df0 = df.iloc[ :, 0 ]
> print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )
>
> df1 = df.iloc[ :, 1 ]
> print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )
>
> z = zip( df0, df1 )
> print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )
>
> l = list( z )
> print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )
>
> result ={ 'zz': l }
> print( "value of { 'zz': l }= \n", type( result ), ':\n"',
> result, '"\n' )
>
> print( result )
>
> Now you can see what each single step does!
>
> df =
> <class 'pandas.core.frame.DataFrame'> :
> " foo1 foo2 foo3 foo4 foo5 foo6
> obj
> foo1 aa ab zz ad ae af
> foo2 ba bb bc bd zz bf
> foo3 ca zz cc cd ce zz
> foo4 da db dc dd de df
> foo5 ea eb ec zz ee ef
> foo6 fa fb fc fd fe ff "
>
> result of where( df == 'zz' ) =
> <class 'pandas.core.frame.DataFrame'> :
> " foo1 foo2 foo3 foo4 foo5 foo6
> obj
> foo1 NaN NaN zz NaN NaN NaN
> foo2 NaN NaN NaN NaN zz NaN
> foo3 NaN zz NaN NaN NaN zz
> foo4 NaN NaN NaN NaN NaN NaN
> foo5 NaN NaN NaN zz NaN NaN
> foo6 NaN NaN NaN NaN NaN NaN "
>
> result of stack() =
> <class 'pandas.core.series.Series'> :
> " obj
> foo1 foo3 zz
> foo2 foo5 zz
> foo3 foo2 zz
> foo6 zz
> foo5 foo4 zz
> dtype: object "
>
> result of reset_index() =
> <class 'pandas.core.frame.DataFrame'> :
> " obj level_1 0
> 0 foo1 foo3 zz
> 1 foo2 foo5 zz
> 2 foo3 foo2 zz
> 3 foo3 foo6 zz
> 4 foo5 foo4 zz "
>
> value of .iloc[ :, 0 ]=
> <class 'pandas.core.series.Series'> :
> " 0 foo1
> 1 foo2
> 2 foo3
> 3 foo3
> 4 foo5
> Name: obj, dtype: object "
>
> value of .iloc[ :, 1 ] =
> <class 'pandas.core.series.Series'> :
> " 0 foo3
> 1 foo5
> 2 foo2
> 3 foo6
> 4 foo4
> Name: level_1, dtype: object "
>
> result of zip( df0, df1 )=
> <class 'zip'> :
> " <zip object at 0x000000000B3B9548>"
>
> result of list( z )=
> <class 'list'> :
> " [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]"
>
> value of { 'zz': l }=
> <class 'dict'> :
> " {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}"
>
> {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}
>
> The script reads a CSV file and stores the data in a Pandas
> DataFrame object named "df". The "index_col=0" parameter tells
> Pandas to use the first column as the index for the DataFrame,
> which is kinda like column headers.
>
> The "where" creates a new DataFrame selection that contains
> the same data as df, but with all values replaced by NaN (Not
> a Number) except for the values that are equal to 'zz'.
>
> "stack" returns a Series with a multi-level index created
> by pivoting the columns. Here it gives a Series with the
> row-col-addresses of a all the non-NaN values. The general
> meaning of "stack" might be the most complex operation of
> this script. It's explained in the pandas manual (see there).
>
> "reset_index" then just transforms this Series back into a
> DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
> first and second column, respectively, of that DataFrame. These
> then are zipped to get the desired form as a list of pairs.
>
And this is a technique very similar to reverse engineering. Thanks for
the explanation and examples. All this is really clear and I was able to
follow it easily because I have already written a version of this code
in C without any kind of external library that uses the .CSV version of
the table as data ( 234 code lines :^/ ).
More information about the Python-list
mailing list