问题描述
问题
我想按两年的间隔对DataFrame条目进行分组,用定界符“#”连接列值,并用定界符“;”以相同的间隔连接条目。
我以前是通过iterating through the years and creating a new DataFrame实现的,但是这很杂乱-我更喜欢矢量化解决方案。
示例输入:
patient_id 2004–2005_dx \
0 Z324563 None
1 Z273652 None#disorder of bone and cartilage
2006–2007_dx 2008–2009_dx \
0 None 725#polymyalgia rheumatica (CMS/HCC)
1 272.0#Pure hypercholesterolemia None
2010–2011_dx \
0 725#polymyalgia rheumatica (CMS/HCC)
1 446.5#Giant cell arteritis (CMS/HCC); 725#polymyalgia rheumatica (CMS/HCC)
2012–2013_dx 2014_dx \
0 427.31#Atrial fibrillation (CMS/HCC) H53.9#Visual disturbance
1 729.81#Swelling of limb None
unkNown_time_dx
0 None
1 None
示例输出:
self.data.groupby(["patient_id",pd.Grouper(freq="2Y",key="date")])
.sum()
.unstack(fill_value=""))
我尝试过的事情
dx_code dx_name
date 2004-12-31 2006-12-31 2010-12-31 2012-12-31 2014-12-31 2004-12-31 2006-12-31 2010-12-31 2012-12-31 2014-12-31
patient_id
Z273652 0 272.0 446.5 729.81725 disorder of bone and cartilage Pure hypercholesterolemia Giant cell arteritis (CMS/HCC) Swelling of limbpolymyalgia rheumatica (CMS/HCC)
Z324563 725 427.31725 H53.9 polymyalgia rheumatica (CMS/HCC) Atrial fibrillation (CMS/HCC)polymyalgia rheum... Visual disturbance
this.trainingList.push(this.currTrainSelect);
但是,我似乎无法弄清楚如何合并两组中的列值。
解决方法
好,让我们创建开始的DataFrame:
content = """ dx_code patient_id dx_name year
0 427.31 Z324563 Atrial fibrillation (CMS/HCC) 2012
1 H53.9 Z324563 Visual disturbance 2014
2 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2009
3 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2011
4 None Z273652 Disorder of bone and cartilage 2004
5 272.0 Z273652 Pure hypercholesterolemia 2006
6 729.81 Z273652 Swelling of limb 2012
7 446.5 Z273652 Giant cell arteritis (CMS/HCC) 2010
8 725 Z273652 Polymyalgia rheumatica (CMS/HCC) 2011
"""
from io import StringIO
df = pd.read_csv(StringIO(content),sep=' ')
print(df)
dx_code patient_id dx_name year
0 427.31 Z324563 Atrial fibrillation (CMS/HCC) 2012
1 H53.9 Z324563 Visual disturbance 2014
2 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2009
3 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2011
4 None Z273652 Disorder of bone and cartilage 2004
5 272.0 Z273652 Pure hypercholesterolemia 2006
6 729.81 Z273652 Swelling of limb 2012
7 446.5 Z273652 Giant cell arteritis (CMS/HCC) 2010
8 725 Z273652 Polymyalgia rheumatica (CMS/HCC) 2011
现在,定义垃圾箱:
import numpy as np
#b = [0,2004,2006,2008,2010,2012,np.inf] # you can make the list if you wish (I suggest start with 0 and finish with np.inf)
b = [x for x in range(2002,2020,2)] # or just to use bigger ranges
如此
df_cut = df.assign(PopGroup=pd.cut(df.year,bins=b))
print(df_cut)
dx_code patient_id dx_name year PopGroup
0 427.31 Z324563 Atrial fibrillation (CMS/HCC) 2012 (2010,2012]
1 H53.9 Z324563 Visual disturbance 2014 (2012,2014]
2 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2009 (2008,2010]
3 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2011 (2010,2012]
4 None Z273652 Disorder of bone and cartilage 2004 (2002,2004]
5 272.0 Z273652 Pure hypercholesterolemia 2006 (2004,2006]
6 729.81 Z273652 Swelling of limb 2012 (2010,2012]
7 446.5 Z273652 Giant cell arteritis (CMS/HCC) 2010 (2008,2010]
8 725 Z273652 Polymyalgia rheumatica (CMS/HCC) 2011 (2010,2012]
让我们加入dx_code和dx_name列:
df_cut['DX_code_name'] = df_cut[['dx_code','dx_name']].agg('#'.join,axis=1)
print(df_cut)
dx_code patient_id ... PopGroup DX_code_name
0 427.31 Z324563 ... (2010,2012] 427.31#Atrial fibrillation (CMS/HCC)
1 H53.9 Z324563 ... (2012,2014] H53.9#Visual disturbance
2 725 Z324563 ... (2008,2010] 725#Polymyalgia rheumatica (CMS/HCC)
3 725 Z324563 ... (2010,2012] 725#Polymyalgia rheumatica (CMS/HCC)
4 None Z273652 ... (2002,2004] None#Disorder of bone and cartilage
5 272.0 Z273652 ... (2004,2006] 272.0#Pure hypercholesterolemia
6 729.81 Z273652 ... (2010,2012] 729.81#Swelling of limb
7 446.5 Z273652 ... (2008,2010] 446.5#Giant cell arteritis (CMS/HCC)
8 725 Z273652 ... (2010,2012] 725#Polymyalgia rheumatica (CMS/HCC)
最后我们使用数据透视表:
table = pd.pivot_table(df_cut,values=['DX_code_name'],index=['patient_id'],columns=['year'],aggfunc=lambda x: '# '.join(x),fill_value=np.nan
)
让我们看看:
table
DX_code_name
year 2004 2006 2009 2010 2011 2012 2014
patient_id
Z273652 None#Disorder of bone and cartilage 272.0#Pure hypercholesterolemia NaN 446.5#Giant cell arteritis (CMS/HCC) 725#Polymyalgia rheumatica (CMS/HCC) 729.81#Swelling of limb NaN
Z324563 NaN NaN 725#Polymyalgia rheumatica (CMS/HCC) NaN 725#Polymyalgia rheumatica (CMS/HCC) 427.31#Atrial fibrillation (CMS/HCC) H53.9#Visual disturbance