并行化文档中所有成对单词组合之间的语音距离

问题描述

我正在尝试计算文档中每个单词之间的语音距离。我的典型文档大约有 30,000 个唯一单词,成对地计算大约 500,000,000 个 (n*(n-1)/2) 组合。我正在使用 pyphonetics 和 RefinedSoundex 来计算距离,这相当快,并且可以在几微秒内计算出一个距离。一份完整的文件大约需要五个小时。我一直在尝试将此并行化,但我不知道为什么它不起作用。认情况下,没有多处理,我使用列表理解。对于多处理,我尝试了 futures.ProcesspoolExecutorray,两者似乎都比列表理解差得多。我不明白为什么他们没有做得更好。我的代码如下,仅使用本文的文本进行基准测试。

from multiprocessing import Pool
import time
import string

import numpy as np
from concurrent import futures
import pandas as pd
from itertools import combinations
from pyphonetics import RefinedSoundex
import ray
ray.init(num_cpus=6)
rs = RefinedSoundex()


def distance(a):
    # time.sleep(1)
    return rs.distance(a[0],a[1])


@ray.remote
def distance2(a):
    # time.sleep(1)
    return rs.distance(a[0],a[1])


def bench(f):
    s = "I am trying to calculate the phonetic distance between every word in a document. \
        My typical document is on the order of thirty thousand words,pairwise that is on \
        the order of five hundred \million combinations to compute. I am using pyphonetics \
        and RefinedSoundex to compute the distance,which is rather quick and computes a \
        single distance in a few microseconds. A full document would then take around \
        five hours. I've been trying to parallelize this,but I can't figure out why it \
        isn't working. For default,no multiprocessing,I'm using list comprehension. \
        For multiprocessing,I've tried futures.ProcesspoolExecutor and ray,both seem \
        to do much worse than list comprehension. I don't understand why they doesn't do \
        better. My code is below,which uses just the text of this post to benchmark."

    s = s.translate(str.maketrans('','',string.punctuation)).split()
    c = combinations(s,2)
    start = time.time()
    f(c)
    elapsed = time.time() - start
    print(f"{f.__name__} completed in {elapsed} seconds")


def listcomp(c):
    [distance(a) for a in c]


def rayray(c):
    # ray.init(num_cpus=6)
    ray.get([distance2.remote(a) for a in c])


def concurrent(c):
    with futures.ProcesspoolExecutor() as executor:
        list(executor.map(distance,c))


bench(listcomp)
bench(concurrent)
bench(rayray)

输出

listcomp completed in 16.1796932220459 seconds
concurrent completed in 102.94504928588867 seconds
rayray completed in 163.66783547401428 seconds

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)