问题描述
在列variants
上的数据框之间映射时,我有2个大熊猫数据框phenotype
和gene
,它应该打印变体数据框的所有行,并用新列HP-ID
隔开, pipe
。这是数据框的几行
import pandas
# variants
data_var = {'CHROM': ['Chr1','Chr11'],'START': [51937273,56867846],'GENE': ['KCNJ1','NPHS2'],'REF': ['C','G'],'ALT': ['T','A']}
variants = pd.DataFrame(data_var)
CHROM START GENE REF ALT
0 Chr1 51937273 KCNJ1 C T
1 Chr11 56867846 NPHS2 G A
# phenotype
data_phe = {'entrez-id': [3758,3758,7827,7827],'KCNJ1','NPHS2','HP-ID': ['HP:0002013','HP:0002007','HP:0001561','HP:0000256','HP:0001508','HP:0003774','HP:0003678','HP:0000093','HP:0003073'],'phenotype': ['Vomiting','Frontal bossing','Polyhydramnios','Macrocephaly','Failure to thrive','Stage 5 chronic kidney disease','Rapidly progressive','Proteinuria','Hypoalbuminemia']}
phenotype = pd.DataFrame(data_phe)
entrez-id GENE HP-ID phenotype
0 3758 KCNJ1 HP:0002013 Vomiting
1 3758 KCNJ1 HP:0002007 Frontal bossing
2 3758 KCNJ1 HP:0001561 Polyhydramnios
3 3758 KCNJ1 HP:0000256 Macrocephaly
4 3758 KCNJ1 HP:0001508 Failure to thrive
5 7827 NPHS2 HP:0003774 Stage 5 chronic kidney disease
6 7827 NPHS2 HP:0003678 Rapidly progressive
7 7827 NPHS2 HP:0000093 Proteinuria
8 7827 NPHS2 HP:0003073 Hypoalbuminemia
所需的输出
CHROM START GENE REF ALT HP-ID
Chr1 51937273 KCNJ1 C T HP:0002013|HP:0002007|HP:0001561|HP:0000256|HP:0001508
Chr6 56867846 NPHS2 G A HP:0003774|HP:0003678|HP:0000093|HP:0003073
我累了
data_frames = [variants,phenotype]
df_marged = reduce(lambda left,right: pd.merge(left,right,on=['GENE'],how='outer'),data_frames)
这会打印出所有变体和表型的行,当它们在另一行之下匹配时。
解决方法
首先由GroupBy.agg
汇总join
,然后使用DataFrame.merge
:
variants.merge(phenotype.groupby('GENE')['HP-ID'].agg('|'.join).reset_index(),on='GENE')