OT: Is there a name for this transformation?
kamaraju kusumanchi
raju.mailinglists at gmail.com
Wed Jul 17 08:49:31 EDT 2019
On Wed, Jul 10, 2019 at 3:08 PM Peter J. Holzer <hjp-python at hjp.at> wrote:
>
> On 2019-07-10 08:57:29 -0400, kamaraju kusumanchi wrote:
> > Given a csv file with the following contents
> >
> > 20180701, A
> > 20180702, A, B
> > 20180703, A, B, C
> > 20180704, B, C
> > 20180705, C
> >
> > I would like to transform the underlying data into a dataframe such as
> >
> > date, A, B, C
> > 20180701, True, False, False
> > 20180702, True, True, False
> > 20180703, True, True, True
> > 20180704, False, True, True
> > 20180705, False, False, True
> >
> > the idea is that the first field in each line of the csv is the row
> > index of the dataframe. The subsequent fields will be its column names
> > and the values in the dataframe tell whether that element is present
> > or not in the line.
> >
> > Is there a name for this transformation?
>
> This type of output is usually called a cross table, but I don't know
> whether this specific transformation has a name (if you had only one of
> A, B, and C per line it would be a kind of pivot operation).
Thanks for telling me about cross table. I found out about
cross-tabulation functionality in Pandas using pandas.crosstab() which
is described in
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
As for my original problem, I solved it as follows:
$cat data.csv
20180701, A
20180702, A, B
20180703, A, B, C
20180704, B, C
20180705, C
import pandas as pd
import numpy as np
# expand the data into two numpy arrays such as
# a = np.array(['20180701', '20180702', '20180702', '20180703',
'20180703', '20180703', '20180704', '20180704', '20180705'])
# b = np.array(['A', 'A', 'B', 'A', 'B', 'C', 'B', 'C', 'C'])
rows = []
cols = []
with open('data.csv') as fo:
for line in fo:
line = line.strip()
elem = line.split(',')
N = len(elem)
rows += elem[0:1] * (N-1)
cols += elem[1:]
a = np.array(rows)
b = np.array(cols)
df = pd.crosstab(a, b, rownames=['date']).astype('bool').reset_index()
which gives
print(df)
col_0 date A B C
0 20180701 True False False
1 20180702 True True False
2 20180703 True True True
3 20180704 False True True
4 20180705 False False True
--
Kamaraju S Kusumanchi | http://raju.shoutwiki.com/wiki/Blog
More information about the Python-list
mailing list