在Python中使用Fuzzywuzzy / rapidfuzz提高字符串匹配性能

问题描述

我有一本非常大的词典，其中存储了大量的英语句子及其西班牙语翻译。我的原始代码如下：

from fuzzywuzzy import process
sentencePairs = {'How are you?':'¿Cómo estás?','Good morning!':'¡Buenos días!'}
query= 'How old are you?'
match = process.extractOne(query,sentencePairs.keys())[0]
print(match,sentencePairs[match],sep='\n')

然后，我使用RapidFuzz而不是fuzzywuzzy来达到更快的速度。我也尝试了多线程，但是令人惊讶的是它并没有太大帮助。我的新代码如下：

from rapidfuzz import process,utils,fuzz
from concurrent.futures import ThreadPoolExecutor
import time,string,random
random.seed(18)

def findMatch(query,dictionary):    
    match,score = process.extractOne(
       utils.default_process(query),dictionary.keys(),processor=None,scorer=fuzz.ratio)
    return (match,score)

# make a dictionary for testing
d = {
    ''.join(random.choice(string.ascii_lowercase + string.digits)
       for _ in range(15)
    ): "spanish text"
    for s in range(1000000)
}

d['how are you?'] = '¿Cómo estás?'
# split the dictionary in half for multithreading
d1 = dict(list(d.items())[:len(d)//2])
d2 = dict(list(d.items())[len(d)//2:])

query= 'How old are you?'

# ---with multithreading---
start_time1 = time.time()
print('Start matching with multithreading...')

with ThreadPoolExecutor() as executor:
    future = executor.submit(findMatch,query,d1)
    match1,score1 = future.result()

with ThreadPoolExecutor() as executor:
    future = executor.submit(findMatch,d2)
    match2,score2 = future.result()

if score1 >= score2 and score1 > 70:
    print(match1,d[match1],sep=' - ')
elif score2 > score1 and score2 > 70:
    print(match2,d[match2],sep=' - ')
else:
    print('No match found.')

print('Time spent with multithreading: {}\n'.format(time.time() - start_time1))

# ---without multithreading---

start_time2 = time.time()
print('Start matching without multithreading...')

match,score = findMatch(query,d)
if score > 70:
    print(match,d[match],sep=' - ')

print('Time spent without multithreading: {}'.format(time.time() - start_time2))

我认为多线程将大大减少匹配时间，但实际上却相反。有没有一种方法可以大大减少匹配时间？还是我使用错误的多线程方法？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

fuzzywuzzy multithreading performance python