InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
我正在使用Pandas,我想要删除两次出现的行,但是就像下面这样……
InteractorA InteractorB
AGAP002741 AGAP008026
AGAP008026 AGAP002741
对此……
InteractorA InteractorB
AGAP002741 AGAP008026
因为它们对于所有意图和目的都是一样的.
是否有内置的方法来处理这个?
解决方法:
我最终创建了一个hacky脚本,它遍历行和必要的数据片段,并检查是否出现连接或是否出现反向,并根据需要删除行索引.
import pandas as pd
checklist = []
indexes_to_drop = []
interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')
for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)
no_dups = interactions.drop(interactions.index[indexes_to_drop])
print no_dups.shape
no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)
2017年编辑:几年后,有了更多的经验,对于寻找类似东西的人来说,这是一个更优雅的解决方案:
In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)
In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026
In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026