OT: Is there a name for this transformation?

Wed Jul 17 08:49:31 EDT 2019

On Wed, Jul 10, 2019 at 3:08 PM Peter J. Holzer <hjp-python at hjp.at> wrote:
>
> On 2019-07-10 08:57:29 -0400, kamaraju kusumanchi wrote:
> > Given a csv file with the following contents
> >
> > 20180701, A
> > 20180702, A, B
> > 20180703, A, B, C
> > 20180704, B, C
> > 20180705, C
> >
> > I would like to transform the underlying data into a dataframe such as
> >
> >     date,     A,     B,     C
> > 20180701,  True, False, False
> > 20180702,  True,  True, False
> > 20180703,  True,  True,  True
> > 20180704, False,  True,  True
> > 20180705, False, False,  True
> >
> > the idea is that the first field in each line of the csv is the row
> > index of the dataframe. The subsequent fields will be its column names
> > and the values in the dataframe tell whether that element is present
> > or not in the line.
> >
> > Is there a name for this transformation?
>
> This type of output is usually called a cross table, but I don't know
> whether this specific transformation has a name (if you had only one of
> A, B, and C per line it would be a kind of pivot operation).

Thanks for telling me about cross table. I found out about
cross-tabulation functionality in Pandas using pandas.crosstab() which
is described in
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

As for my original problem, I solved it as follows:

$cat data.csv
20180701, A
20180702, A, B
20180703, A, B, C
20180704, B, C
20180705, C

import pandas as pd
import numpy as np

# expand the data into two numpy arrays such as
# a = np.array(['20180701', '20180702', '20180702', '20180703',
'20180703', '20180703', '20180704', '20180704', '20180705'])
# b = np.array(['A', 'A', 'B', 'A', 'B', 'C', 'B', 'C', 'C'])

rows = []
cols = []

with open('data.csv') as fo:
    for line in fo:
        line = line.strip()
        elem = line.split(',')
        N = len(elem)
        rows += elem[0:1] * (N-1)
        cols += elem[1:]

a = np.array(rows)
b = np.array(cols)

df = pd.crosstab(a, b, rownames=['date']).astype('bool').reset_index()

which gives

print(df)
col_0      date      A      B      C
0      20180701   True  False  False
1      20180702   True   True  False
2      20180703   True   True   True
3      20180704  False   True   True
4      20180705  False  False   True

-- 
Kamaraju S Kusumanchi | http://raju.shoutwiki.com/wiki/Blog