help: pandas and 2d table

Mon Apr 15 02:05:18 EDT 2024

Stefan Ram ha scritto:
> jak <nospam at please.ty> wrote or quoted:
>> Stefan Ram ha scritto:
>>> df = df.where( df == 'zz' ).stack().reset_index()
>>> result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
>> Since I don't know Pandas, I will need a month at least to understand
>> these 2 lines of code. Thanks again.
> 
>    Here's a technique to better understand such code:
> 
>    Transform it into a program with small statements and small
>    expressions with no more than one call per statement if possible.
>    (After each litte change check that the output stays the same.)
> 
> import pandas as pd
> 
> # Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
> with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
>      print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
> foo1,aa,ab,zz,ad,ae,af
> foo2,ba,bb,bc,bd,zz,bf
> foo3,ca,zz,cc,cd,ce,zz
> foo4,da,db,dc,dd,de,df
> foo5,ea,eb,ec,zz,ee,ef
> foo6,fa,fb,fc,fd,fe,ff''', file=out )
> # Note the "index_col=0" below, which is important here!
> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
> 
> selection = df.where( df == 'zz' )
> selection_stack = selection.stack()
> df = selection_stack.reset_index()
> df0 = df.iloc[ :, 0 ]
> df1 = df.iloc[ :, 1 ]
> z = zip( df0, df1 )
> l = list( z )
> result ={ 'zz': l }
> print( result )
> 
>    I suggest to next insert print statements to print each intermediate
>    value:
> 
> # Note the "index_col=0" below, which is important here!
> df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
> print( 'df = \n', type( df ), ':\n"', df, '"\n' )
> 
> selection = df.where( df == 'zz' )
> print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
>    selection, '"\n' )
> 
> selection_stack = selection.stack()
> print( 'result of stack() = \n', type( selection_stack ), ':\n"',
>    selection_stack, '"\n' )
> 
> df = selection_stack.reset_index()
> print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )
> 
> df0 = df.iloc[ :, 0 ]
> print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )
> 
> df1 = df.iloc[ :, 1 ]
> print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )
> 
> z = zip( df0, df1 )
> print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )
> 
> l = list( z )
> print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )
> 
> result ={ 'zz': l }
> print( "value of { 'zz': l }= \n", type( result ), ':\n"',
>    result, '"\n' )
> 
> print( result )
> 
>    Now you can see what each single step does!
> 
> df =
>   <class 'pandas.core.frame.DataFrame'> :
> "      foo1 foo2 foo3 foo4 foo5 foo6
> obj
> foo1   aa   ab   zz   ad   ae   af
> foo2   ba   bb   bc   bd   zz   bf
> foo3   ca   zz   cc   cd   ce   zz
> foo4   da   db   dc   dd   de   df
> foo5   ea   eb   ec   zz   ee   ef
> foo6   fa   fb   fc   fd   fe   ff "
> 
> result of where( df == 'zz' ) =
>   <class 'pandas.core.frame.DataFrame'> :
> "      foo1 foo2 foo3 foo4 foo5 foo6
> obj
> foo1  NaN  NaN   zz  NaN  NaN  NaN
> foo2  NaN  NaN  NaN  NaN   zz  NaN
> foo3  NaN   zz  NaN  NaN  NaN   zz
> foo4  NaN  NaN  NaN  NaN  NaN  NaN
> foo5  NaN  NaN  NaN   zz  NaN  NaN
> foo6  NaN  NaN  NaN  NaN  NaN  NaN "
> 
> result of stack() =
>   <class 'pandas.core.series.Series'> :
> " obj
> foo1  foo3    zz
> foo2  foo5    zz
> foo3  foo2    zz
>        foo6    zz
> foo5  foo4    zz
> dtype: object "
> 
> result of reset_index() =
>   <class 'pandas.core.frame.DataFrame'> :
> "     obj level_1   0
> 0  foo1    foo3  zz
> 1  foo2    foo5  zz
> 2  foo3    foo2  zz
> 3  foo3    foo6  zz
> 4  foo5    foo4  zz "
> 
> value of .iloc[ :, 0 ]=
>   <class 'pandas.core.series.Series'> :
> " 0    foo1
> 1    foo2
> 2    foo3
> 3    foo3
> 4    foo5
> Name: obj, dtype: object "
> 
> value of .iloc[ :, 1 ] =
>   <class 'pandas.core.series.Series'> :
> " 0    foo3
> 1    foo5
> 2    foo2
> 3    foo6
> 4    foo4
> Name: level_1, dtype: object "
> 
> result of zip( df0, df1 )=
>   <class 'zip'> :
> " <zip object at 0x000000000B3B9548>"
> 
> result of list( z )=
>   <class 'list'> :
> " [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]"
> 
> value of { 'zz': l }=
>   <class 'dict'> :
> " {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}"
> 
> {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}
> 
>    The script reads a CSV file and stores the data in a Pandas
>    DataFrame object named "df". The "index_col=0" parameter tells
>    Pandas to use the first column as the index for the DataFrame,
>    which is kinda like column headers.
> 
>    The "where" creates a new DataFrame selection that contains
>    the same data as df, but with all values replaced by NaN (Not
>    a Number) except for the values that are equal to 'zz'.
> 
>    "stack" returns a Series with a multi-level index created
>    by pivoting the columns. Here it gives a Series with the
>    row-col-addresses of a all the non-NaN values. The general
>    meaning of "stack" might be the most complex operation of
>    this script. It's explained in the pandas manual (see there).
> 
>    "reset_index" then just transforms this Series back into a
>    DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
>    first and second column, respectively, of that DataFrame. These
>    then are zipped to get the desired form as a list of pairs.
> 

And this is a technique very similar to reverse engineering. Thanks for
the explanation and examples. All this is really clear and I was able to
follow it easily because I have already written a version of this code
in C without any kind of external library that uses the .CSV version of
the table as data ( 234 code lines :^/ ).