问题描述
计算字符串中两个字母对的数量(即 AA、AB、AC 等)的最快方法是什么?是否可以使用 numpy 来加速这个计算?
我正在使用带有 str.count()
的列表理解,但这很慢。
import itertools
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFlrsNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINgalYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars,chars)]
print(pairs[:10])
print(len(pairs))
['AA','AC','AD','AE','AF','AG','AH','AI','AK','AL']
400
%timeit counts = np.array([seq.count(pair) for pair in pairs])
231 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs,10000 loops each)
print counts[:10]
[0,1,0]
解决方法
如果您不介意在字典中获取计数,则集合中的 Counter 类的处理速度会快 2-3 倍:
from collections import Counter
chars = set('ACDEFGHIKLMNPQRSTVWY')
counts = Counter( a+b for a,b in zip(seq,seq[1:]) if a in chars and b in chars)
print(counts)
Counter({'RS': 4,'VV': 4,'SI': 4,'MR': 3,'SG': 3,'LL': 3,'LS': 3,'PL': 3,'IE': 3,'DI': 3,'IA': 3,'AN': 3,'VK': 3,'KE': 3,'EV': 3,'TS': 3,'NL': 2,'LA': 2,'IP': 2,'AR': 2,'SK': 2,...
此方法将正确计算重复 3 次或更多次的相同字符的序列(即,“WWW”将计算为 2 表示“WW”,而 seq.count() 或 re.findall() 仅计算为 1)。
请记住,Counter 字典将为 counts['LC'] 返回零,但 counts.items() 将不包含 'LC' 或实际上不在字符串中的任何其他对。
如果需要,您可以在第二步中获得所有理论对的计数:
from itertools import product
chars = 'ACDEFGHIKLMNPQRSTVWY'
print([counts[a+b] for a,b in product(chars,chars)][:10])
[1,1,1]
,
有一个 numpy 函数,np.char.count()
。但它似乎比 str.count()
慢得多。
%timeit counts = np.array([np.char.count(seq,pair) for pair in pairs])
1.79 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
,
由于速度至关重要,以下是不同方法的比较:
import numpy as np
import itertools
from collections import Counter
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars,chars)]
def countpairs1():
return np.array([seq.count(pair) for pair in pairs])
%timeit counts = countpairs1()
144 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs,10000 loops each)
def countpairs2():
counted = Counter(a+b for a,seq[1:]))
return np.array([counted[pair] for pair in pairs])
%timeit counts = countpairs2()
102 µs ± 729 ns per loop (mean ± std. dev. of 7 runs,10000 loops each)
def countpairs3():
return np.array([np.char.count(seq,pair) for pair in pairs])
%timeit counts = countpairs3()
1.65 ms ± 4.62 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
显然,最好/最快的方法是Counter
。