如何从具有多个重复列级别的Excel表中解压DF?设置多索引?

问题描述

从xlsx中读取的

df:df = pd.read_excel('file.xlsx')如下所示:

   Age Male Female Male.1 Female.1
0  NaN  Big  Small  Small      Big
1  1.0    2      3      2        3
2  2.0    3      4      3        4
3  3.0    4      5      4        5
df = pd.DataFrame({'Age':[np.nan,1,2,3],'Male':['Big',3,4],'Female':['Small',4,5],'Male.1':['Small','Female.1':['Big',5]})

请注意,Pandas为重复的列.1加了后缀,这是不希望的。我想拆开/融化以得到这个或类似的东西:

    Age Gender  Size    [measure]
1   1   Male    Big     2
2   2   Male    Big     3
3   3   Male    Big     4
4   1   Female  Big     3
5   2   Female  Big     4
6   3   Female  Big     5
7   1   Male    Small   2
8   2   Male    Small   3
9   3   Male    Small   4
10  1   Female  Small   3
11  2   Female  Small   4
12  3   Female  Small   5

重命名列和取消堆叠的操作很接近,但没有雪茄:

df= df.rename(columns={'Male.1': 'Male','Female.1':'Female'})
df= df.set_index(['Age']).unstack()

如何将第一行设置为列的第二索引级别,如here所示?我想念什么?

解决方法

代替.unstack()的另一种方法是.melt()

您可以使用.T转置数据帧,并使用.iloc[1:]提取第一行之后的所有内容。然后,.rename列,.replace .1和一些正则表达式,.melt数据帧和.sort_values

df = pd.DataFrame({'Age':[np.nan,1,2,3],'Male':['Big',3,4],'Female':['Small',4,5],'Male.1':['Small','Female.1':['Big',5]})
df = (df.T.reset_index().iloc[1:]
      .rename({'index' : 'Gender',0 : 'Size'},axis=1)
      .replace(r'\.\d+$','',regex=True)
      .melt(id_vars=['Gender','Size'],value_name='[measure]',var_name='Age')
      .sort_values(['Size','Gender','Age'],ascending=[True,False,True])
      .reset_index(drop=True))
df = df[['Age','Size','[measure]']]      
df
Out[41]: 
   Age  Gender   Size  [measure]
0    1    Male    Big          2
1    2    Male    Big          3
2    3    Male    Big          4
3    1  Female    Big          3
4    2  Female    Big          4
5    3  Female    Big          5
6    1    Male  Small          2
7    2    Male  Small          3
8    3    Male  Small          4
9    1  Female  Small          3
10   2  Female  Small          4
11   3  Female  Small          5
,

如果可能,创建前两行MultiIndex,并创建第一列以read_excel中的headerindex_col参数进行索引:

df = pd.read_excel('file.xlsx',header=[0,1],index_col=[0])
    
print (df)
Age Male Female  Male Female
     Big  Small Small    Big
1.0    2      3     2      3
2.0    3      4     3      4
3.0    4      5     4      5

print (df.columns)
MultiIndex([(  'Male','Big'),('Female','Small'),(  'Male','Big')],names=['Age',None])

print (df.index)
Float64Index([1.0,2.0,3.0],dtype='float64')

因此可以使用DataFrame.unstack

df = (df.unstack()
        .rename_axis(['Gender','Age'])
        .reset_index(name='measure'))
print (df)
    Gender   Size  Age  measure
0     Male    Big  1.0        2
1     Male    Big  2.0        3
2     Male    Big  3.0        4
3   Female  Small  1.0        3
4   Female  Small  2.0        4
5   Female  Small  3.0        5
6     Male  Small  1.0        2
7     Male  Small  2.0        3
8     Male  Small  3.0        4
9   Female    Big  1.0        3
10  Female    Big  2.0        4
11  Female    Big  3.0        5

如果不可能,请使用:

您可以通过MultiIndex.from_arrays创建MultiIndex,并用数字.删除最后一个replace,然后通过DataFrame.iloc过滤掉第一行并通过{{ 3}}按第一列,最后设置新的列名称:

df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$',''),df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','measure']
print (df)
    Age  Gender   Size measure
0   1.0    Male    Big       2
1   2.0    Male    Big       3
2   3.0    Male    Big       4
3   1.0  Female  Small       3
4   2.0  Female  Small       4
5   3.0  Female  Small       5
6   1.0    Male  Small       2
7   2.0    Male  Small       3
8   3.0    Male  Small       4
9   1.0  Female    Big       3
10  2.0  Female    Big       4
11  3.0  Female    Big       5

或者用DataFrame.melt解决方案是可能的,对于新的列名,仅将DataFrame.unstack的第一列设置为index,并用DataFrame.set_indexMultiIndex级别设置为:

df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$',df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
        .unstack()
        .rename_axis(['Gender','Age'])
        .reset_index(name='measure'))
print (df)
    Gender   Size  Age measure
0     Male    Big  1.0       2
1     Male    Big  2.0       3
2     Male    Big  3.0       4
3   Female  Small  1.0       3
4   Female  Small  2.0       4
5   Female  Small  3.0       5
6     Male  Small  1.0       2
7     Male  Small  2.0       3
8     Male  Small  3.0       4
9   Female    Big  1.0       3
10  Female    Big  2.0       4
11  Female    Big  3.0       5
,

通过将第0行与该列组合来创建多索引列:

df.columns = pd.MultiIndex.from_arrays((df.columns,df.iloc[0]))
df.columns.names = ['gender','size']

df.columns

MultiIndex([(     'Age',nan),(    'Male',(  'Female',(  'Male.1',('Female.1',names=['gender','size'])

现在您可以重塑形状并重命名:

 (df
  .dropna()
  .melt([('Age',np.NaN)],value_name='measure')
  .replace(r'\.\d+$',regex=True)
  .rename(columns={("Age",np.NaN) : "Age"}))

   Age  gender  size measure
0   1.0 Male    Big     2
1   2.0 Male    Big     3
2   3.0 Male    Big     4
3   1.0 Female  Small   3
4   2.0 Female  Small   4
5   3.0 Female  Small   5
6   1.0 Male    Small   2
7   2.0 Male    Small   3
8   3.0 Male    Small   4
9   1.0 Female  Big     3
10  2.0 Female  Big     4
11  3.0 Female  Big     5