问题描述
从xlsx中读取的
df:df = pd.read_excel('file.xlsx')
如下所示:
Age Male Female Male.1 Female.1
0 NaN Big Small Small Big
1 1.0 2 3 2 3
2 2.0 3 4 3 4
3 3.0 4 5 4 5
df = pd.DataFrame({'Age':[np.nan,1,2,3],'Male':['Big',3,4],'Female':['Small',4,5],'Male.1':['Small','Female.1':['Big',5]})
请注意,Pandas为重复的列.1
加了后缀,这是不希望的。我想拆开/融化以得到这个或类似的东西:
Age Gender Size [measure]
1 1 Male Big 2
2 2 Male Big 3
3 3 Male Big 4
4 1 Female Big 3
5 2 Female Big 4
6 3 Female Big 5
7 1 Male Small 2
8 2 Male Small 3
9 3 Male Small 4
10 1 Female Small 3
11 2 Female Small 4
12 3 Female Small 5
重命名列和取消堆叠的操作很接近,但没有雪茄:
df= df.rename(columns={'Male.1': 'Male','Female.1':'Female'})
df= df.set_index(['Age']).unstack()
如何将第一行设置为列的第二索引级别,如here所示?我想念什么?
解决方法
代替.unstack()
的另一种方法是.melt()
。
您可以使用.T
转置数据帧,并使用.iloc[1:]
提取第一行之后的所有内容。然后,.rename
列,.replace
.1
和一些正则表达式,.melt
数据帧和.sort_values
。
df = pd.DataFrame({'Age':[np.nan,1,2,3],'Male':['Big',3,4],'Female':['Small',4,5],'Male.1':['Small','Female.1':['Big',5]})
df = (df.T.reset_index().iloc[1:]
.rename({'index' : 'Gender',0 : 'Size'},axis=1)
.replace(r'\.\d+$','',regex=True)
.melt(id_vars=['Gender','Size'],value_name='[measure]',var_name='Age')
.sort_values(['Size','Gender','Age'],ascending=[True,False,True])
.reset_index(drop=True))
df = df[['Age','Size','[measure]']]
df
Out[41]:
Age Gender Size [measure]
0 1 Male Big 2
1 2 Male Big 3
2 3 Male Big 4
3 1 Female Big 3
4 2 Female Big 4
5 3 Female Big 5
6 1 Male Small 2
7 2 Male Small 3
8 3 Male Small 4
9 1 Female Small 3
10 2 Female Small 4
11 3 Female Small 5
,
如果可能,创建前两行MultiIndex
,并创建第一列以read_excel
中的header
和index_col
参数进行索引:
df = pd.read_excel('file.xlsx',header=[0,1],index_col=[0])
print (df)
Age Male Female Male Female
Big Small Small Big
1.0 2 3 2 3
2.0 3 4 3 4
3.0 4 5 4 5
print (df.columns)
MultiIndex([( 'Male','Big'),('Female','Small'),( 'Male','Big')],names=['Age',None])
print (df.index)
Float64Index([1.0,2.0,3.0],dtype='float64')
因此可以使用DataFrame.unstack
:
df = (df.unstack()
.rename_axis(['Gender','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
如果不可能,请使用:
您可以通过MultiIndex.from_arrays
创建MultiIndex
,并用数字.
删除最后一个replace
,然后通过DataFrame.iloc
过滤掉第一行并通过{{ 3}}按第一列,最后设置新的列名称:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$',''),df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','measure']
print (df)
Age Gender Size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
或者用DataFrame.melt
解决方案是可能的,对于新的列名,仅将DataFrame.unstack
的第一列设置为index
,并用DataFrame.set_index
的MultiIndex
级别设置为:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$',df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
.unstack()
.rename_axis(['Gender','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
,
通过将第0行与该列组合来创建多索引列:
df.columns = pd.MultiIndex.from_arrays((df.columns,df.iloc[0]))
df.columns.names = ['gender','size']
df.columns
MultiIndex([( 'Age',nan),( 'Male',( 'Female',( 'Male.1',('Female.1',names=['gender','size'])
现在您可以重塑形状并重命名:
(df
.dropna()
.melt([('Age',np.NaN)],value_name='measure')
.replace(r'\.\d+$',regex=True)
.rename(columns={("Age",np.NaN) : "Age"}))
Age gender size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5