Pandas 数据框将行值重塑为新列矩阵类型格式

问题描述

我是 Pandas 的新手，正在寻找有关如何重塑数据框的建议：

目前，我有一个这样的数据框。

panellist_id	类型	type_count	refer_sm_count	refer_se_count	refer_non_n_count
1	惠普	2	2	1	1
1	PB	1	0	1	0
1	TN	3	0	3	0
2	惠普	1	1	0	0
2	PB	2	1	1	0	0

理想情况下，我希望我的数据框看起来像这样：

panellist_id	type_HP_count	type_PB_count	type_TN_count	refer_sm_count_HP	refer_se_count_HP	refer_non_n_count_HP	refer_sm_count_PB	refer_se_count_PB	refer_non_n_count_PB	refer_sm_count_TN	refer_se_count_TN	refer_non_n_count_TN
1	2	1	3	2	1	0	0	1	0	0	0	0
2	1	2	0	1	0	0	1	1	0	0	0	0

基本上，我需要将“类型”列中的不同行值转换为新列，显示每种类型的计数。原始 df 标题为“引用”的接下来三列需要考虑每种不同的“类型”。例如，refers_sm_count_[来自类型X（例如HP）]。任何帮助将非常感激。谢谢

解决方法

通过 pivot_table() 和 rename_axis() 方法尝试：

out=(df.pivot_table(index='panelist_id',columns='type',fill_value=0)
      .rename_axis(columns=[None,None],index=None))

最后使用map()方法和.columns属性：

out.columns=out.columns.map('_'.join)

现在如果你打印 out 你会得到你想要的输出

通过 pivot_wider 的 pyjanitor 选项：

new_df = df.pivot_wider(index='panelist_id',names_from='type',names_from_position='last',fill_value=0)

new_df：

panelist_id  type_count_HP  type_count_PB  type_count_TN  refer_sm_count_HP  refer_sm_count_PB  refer_sm_count_TN  refer_se_count_HP  refer_se_count_PB  refer_se_count_TN  refer_non_n_count_HP  refer_non_n_count_PB  refer_non_n_count_TN
          1              2              1              3                  2                  0                  0                  1                  1                  3                     1                     0                     0
          2              1              2              0                  1                  1                  0                  0                  1                  0                     0                     0                     0

完整的工作示例：

import janitor
import pandas as pd

df = pd.DataFrame({
    'panelist_id': [1,1,2,2],'type': ['HP','PB','TN','HP','PB'],'type_count': [2,3,'refer_sm_count': [2,1],'refer_se_count': [1,'refer_non_n_count': [1,0]
})

new_df = df.pivot_wider(index='panelist_id',fill_value=0)

print(new_df.to_string(index=False))

再添加一个选项：

df = df.set_index(['panelist_id','type']).unstack(-1,fill_value=0)
df.columns = df.columns.map('_'.join)

使用pivot_table创建多索引

df_p = df.pivot_table(index='panelist_id',aggfunc=sum)

            refer_non_n_count           refer_se_count            \
type                       HP   PB   TN             HP   PB   TN   
panelist_id                                                        
1                         1.0  0.0  0.0            1.0  1.0  3.0   
2                         0.0  0.0  NaN            0.0  1.0  NaN   

            refer_sm_count           type_count            
type                    HP   PB   TN         HP   PB   TN  
panelist_id                                                
1                      2.0  0.0  0.0        2.0  1.0  3.0  
2                      1.0  1.0  NaN        1.0  2.0  NaN

如果您确实想展平列，则

df_p.columns = ['_'.join(col) for col in df_p.columns.values]

首先，导入库：

import numpy as np
import pandas as pd

然后，读取您的数据：

data = pd.read_excel('base.xlsx')

使用 pivot_table 重塑您的数据：

data_reshaped = pd.pivot_table(data,values=['type_count','refer_sm_count','refer_se_count','refer_non_n_count'],index=['panelist_id'],columns=['type'],aggfunc=np.sum)

但是，您的索引不会很好。所以，然后重置：

columns = [data_reshaped.columns[i][0] + '_' + data_reshaped.columns[i][1]
           for i in range(len(data_reshaped.columns))] # to create new columns names

data_reshaped.columns = columns # to assign new columns names to dataframe
data_reshaped.reset_index(inplace=True) # to reset index
data_reshaped.fillna(0,inplace=True) # to substitute nan to 0

然后，你的数据就会很好

dataframe pandas pandas python reshape