从大型歌手中寻找最匹配的词

问题描述

我有一个 Pandas 数据框，其中包含名为 Potential Word、Fixed Word 的两列。 Potential Word 列包含不同语言的单词，其中包含拼写错误的单词和正确的单词，Fixed Word 列包含对应于 Potential Word 的正确单词。

下面我分享了一些样本数据

潜在词	固定字
示例	示例
pipol	人物
痘痘	痘痘
Iunik	独特

我的 vocab 数据框包含 600K 唯一行。

我的解决方案：

key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
    match_value = match(each,key) # match is a function that returns a 
    # similarity value of two strings
    if match_value > glob_match_value and match_value > match_threshold:
        glob_match_value = match_value
        potential_fixed_word = each

问题

我的代码有问题，因为循环遍历大型词汇表，所以需要花费大量时间来修复每个单词。当词汇中缺少一个单词时，解决一个 10 ~ 12 个单词的句子需要将近 5 或 6 秒的时间。匹配函数表现不错，所以优化的目标。

我需要优化的解决方案在这里帮助我

解决方法

从Information Retrieval (IR)的角度来看，你需要减少搜索空间。将 given_word（作为 key）与所有 Potential Word 匹配绝对是低效的。相反，您需要匹配合理数量的候选人。

要找到这样的候选词，您需要索引潜在词和固定词。

from whoosh.analysis import StandardAnalyzer
from whoosh.fields import Schema,TEXT
from whoosh.index import create_in

ix = create_in("indexdir",Schema(
    potential=TEXT(analyzer=StandardAnalyzer(stoplist=None),stored=True),fixed=TEXT(analyzer=StandardAnalyzer(stoplist=None),stored=True)
))
writer = ix.writer()
writer.add_document(potential='E x e m p l e',fixed='Example')
writer.add_document(potential='p i p o l',fixed='People')
writer.add_document(potential='p i m p l e',fixed='Pimple')
writer.add_document(potential='l u n i k',fixed='unique')
writer.commit()

通过这个索引，你可以搜索一些候选人。

from whoosh.qparser import SimpleParser

with ix.searcher() as searcher:
    results = searcher.search(SimpleParser('potential',ix.schema).parse('p i p o l'))
    for result in results[:2]:
        print(result)

输出是

<Hit {'fixed': 'People','potential': 'p i p o l'}>
<Hit {'fixed': 'Pimple','potential': 'p i m p l e'}>

现在，您可以match given_word 只针对少数候选人，而不是全部 600K。

它并不完美，但是，这是不可避免的权衡以及 IR 的基本工作原理。尝试使用不同数量的候选人。

不会对您的实现进行太多更改，因为我认为在某种程度上需要迭代每个单词的潜在单词列表。

这里我的目的不是优化匹配函数本身，而是利用多个线程并行搜索。

import concurrent.futures
import time
from concurrent.futures.thread import ThreadPoolExecutor
from typing import Any,Union,Iterator

import pandas as pd

# Replace your dataframe here for testing this

df = pd.DataFrame({'Potential Word': ["a","b","c"],"Fixed Word": ["a","c","b"]})

# Replace by your match function

def match(w1,w2):
    # Simulate some work is happening here
    time.sleep(1)
    return 1

# This is mostly your function itself
# Using index to recreate the sentence from the returned values
def matcher(idx,given_word):
    key = given_word
    glob_match_value = 0
    potential_fixed_word = ''
    match_threshold = 0.65
    for each in df['Potential Word']:
        match_value = match(each,key)  # match is a function that returns a
        # similarity value of two strings
        if match_value > glob_match_value and match_value > match_threshold:
            glob_match_value = match_value
            potential_fixed_word = each
            return idx,potential_fixed_word
        else:
            # Handling default case,you might want to change this
            return idx,""


sentence = "match is a function that returns a similarity value of two strings match is a function that returns a " \
           "similarity value of two strings"

start = time.time()

# Using a threadpool executor 
# You can increase or decrease the max_workers based on your machine
executor: Union[ThreadPoolExecutor,Any]
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
    futures: Iterator[Union[str,Any]] = executor.map(matcher,list(range(len(sentence.split()))),sentence.split())

# Joining back the input sentence
out_sentence = " ".join(x[1] for x in sorted(futures,key=lambda x: x[0]))
print(out_sentence)
print(time.time() - start)

请注意，此操作的运行时间取决于

单个匹配调用所用的最长时间
句子中的单词数
工作线程的数量（提示：试试看能不能和句子中的单词数量一样多）

我会使用 sortedcollections 模块。一般来说，访问 SortedList 或 SortedDict 的时间是 O(log(n)) 而不是 O(n)；在您的情况下，19.1946 if/then 检查与 600,000 if/then 检查。

from sortedcollections import SortedDict

dataframe optimization optimization pattern-matching python string-matching