如何连接DataFrame组中的列值?

问题描述

问题

我想按两年的间隔对DataFrame条目进行分组,用定界符“#”连接列值,并用定界符“;”以相同的间隔连接条目。

我以前是通过iterating through the years and creating a new DataFrame实现的,但是这很杂乱-我更喜欢矢量化解决方案。

示例输入:

  patient_id                         2004–2005_dx  \
0    Z324563                                 None   
1    Z273652  None#disorder of bone and cartilage   

                      2006–2007_dx                          2008–2009_dx  \
0                             None  725#polymyalgia rheumatica (CMS/HCC)   
1  272.0#Pure hypercholesterolemia                                  None   

                                                                 2010–2011_dx  \
0                                        725#polymyalgia rheumatica (CMS/HCC)   
1  446.5#Giant cell arteritis (CMS/HCC); 725#polymyalgia rheumatica (CMS/HCC)   

                           2012–2013_dx                   2014_dx  \
0  427.31#Atrial fibrillation (CMS/HCC)  H53.9#Visual disturbance   
1               729.81#Swelling of limb                      None   

  unkNown_time_dx  
0            None  
1            None  

示例输出

self.data.groupby(["patient_id",pd.Grouper(freq="2Y",key="date")])
                .sum()
                .unstack(fill_value=""))

我尝试过的事情

在回答this之后,我有以下代码

              dx_code                                                                     dx_name                                                                                                                                    
date       2004-12-31 2006-12-31 2010-12-31 2012-12-31 2014-12-31                      2004-12-31                 2006-12-31                        2010-12-31                                         2012-12-31          2014-12-31
patient_id                                                                                                                                                                                                                           
Z273652             0      272.0      446.5  729.81725             disorder of bone and cartilage  Pure hypercholesterolemia    Giant cell arteritis (CMS/HCC)   Swelling of limbpolymyalgia rheumatica (CMS/HCC)                    
Z324563                                 725  427.31725      H53.9                                                             polymyalgia rheumatica (CMS/HCC)  Atrial fibrillation (CMS/HCC)polymyalgia rheum...  Visual disturbance

输出以下内容

  this.trainingList.push(this.currTrainSelect);

但是,我似乎无法弄清楚如何合并两组中的列值。

解决方法

好,让我们创建开始的DataFrame:

content = """  dx_code  patient_id  dx_name  year
0  427.31  Z324563  Atrial fibrillation (CMS/HCC)  2012
1  H53.9  Z324563  Visual disturbance  2014
2  725  Z324563  Polymyalgia rheumatica (CMS/HCC)  2009
3  725  Z324563  Polymyalgia rheumatica (CMS/HCC)  2011
4  None  Z273652  Disorder of bone and cartilage  2004
5  272.0  Z273652  Pure hypercholesterolemia  2006
6  729.81  Z273652  Swelling of limb  2012
7  446.5  Z273652  Giant cell arteritis (CMS/HCC)  2010
8  725  Z273652  Polymyalgia rheumatica (CMS/HCC)  2011
"""
from io import StringIO
df = pd.read_csv(StringIO(content),sep='  ')
print(df)

  dx_code patient_id                           dx_name  year
0  427.31    Z324563     Atrial fibrillation (CMS/HCC)  2012
1   H53.9    Z324563                Visual disturbance  2014
2     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2009
3     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2011
4    None    Z273652    Disorder of bone and cartilage  2004
5   272.0    Z273652         Pure hypercholesterolemia  2006
6  729.81    Z273652                  Swelling of limb  2012
7   446.5    Z273652    Giant cell arteritis (CMS/HCC)  2010
8     725    Z273652  Polymyalgia rheumatica (CMS/HCC)  2011

现在,定义垃圾箱:

import numpy as np
#b = [0,2004,2006,2008,2010,2012,np.inf] # you can make the list if you wish (I suggest start with 0 and finish with np.inf)
b = [x for x in range(2002,2020,2)] # or just to use bigger ranges

如此

df_cut = df.assign(PopGroup=pd.cut(df.year,bins=b))
print(df_cut)
  dx_code patient_id                           dx_name  year      PopGroup
0  427.31    Z324563     Atrial fibrillation (CMS/HCC)  2012  (2010,2012]
1   H53.9    Z324563                Visual disturbance  2014  (2012,2014]
2     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2009  (2008,2010]
3     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2011  (2010,2012]
4    None    Z273652    Disorder of bone and cartilage  2004  (2002,2004]
5   272.0    Z273652         Pure hypercholesterolemia  2006  (2004,2006]
6  729.81    Z273652                  Swelling of limb  2012  (2010,2012]
7   446.5    Z273652    Giant cell arteritis (CMS/HCC)  2010  (2008,2010]
8     725    Z273652  Polymyalgia rheumatica (CMS/HCC)  2011  (2010,2012]

让我们加入dx_code和dx_name列:

df_cut['DX_code_name'] = df_cut[['dx_code','dx_name']].agg('#'.join,axis=1)
print(df_cut)
  dx_code patient_id  ...      PopGroup                          DX_code_name
0  427.31    Z324563  ...  (2010,2012]  427.31#Atrial fibrillation (CMS/HCC)
1   H53.9    Z324563  ...  (2012,2014]              H53.9#Visual disturbance
2     725    Z324563  ...  (2008,2010]  725#Polymyalgia rheumatica (CMS/HCC)
3     725    Z324563  ...  (2010,2012]  725#Polymyalgia rheumatica (CMS/HCC)
4    None    Z273652  ...  (2002,2004]   None#Disorder of bone and cartilage
5   272.0    Z273652  ...  (2004,2006]       272.0#Pure hypercholesterolemia
6  729.81    Z273652  ...  (2010,2012]               729.81#Swelling of limb
7   446.5    Z273652  ...  (2008,2010]  446.5#Giant cell arteritis (CMS/HCC)
8     725    Z273652  ...  (2010,2012]  725#Polymyalgia rheumatica (CMS/HCC)

最后我们使用数据透视表:

table = pd.pivot_table(df_cut,values=['DX_code_name'],index=['patient_id'],columns=['year'],aggfunc=lambda x: '# '.join(x),fill_value=np.nan
                    )

让我们看看:

table
DX_code_name
year    2004    2006    2009    2010    2011    2012    2014
patient_id                          
Z273652 None#Disorder of bone and cartilage 272.0#Pure hypercholesterolemia NaN 446.5#Giant cell arteritis (CMS/HCC)    725#Polymyalgia rheumatica (CMS/HCC)    729.81#Swelling of limb NaN
Z324563 NaN NaN 725#Polymyalgia rheumatica (CMS/HCC)    NaN 725#Polymyalgia rheumatica (CMS/HCC)    427.31#Atrial fibrillation (CMS/HCC)    H53.9#Visual disturbance