我如何加速这个 R 代码，我在其中使用 stringdist？

问题描述

我正在尝试通过识别足够相似的客户数据来清理我们的客户数据库，以将他们视为同一客户（因此，为他们提供相同的客户 ID）。我已将相关的客户数据连接到一个名为 customerdata 的列中。我找到了 R 包 stringdist，我使用以下代码来计算每条记录之间的距离：

output <- df$id


 for(i in 1:(length(df$customerdata)-1) ){
      for(j in (i+1):length(df$customerdata)){
          if(abs(df$customerdataLEN[i]-df$customerdataLEN[j]) < 10){

          
          if( stringdist(df$customerdata[i],df$customerdata[j])<10){
            output[j] <- df$id[i]
          }
          
        }
        
      }
    }

df$newcustomerid <- output

所以在这里，我首先用 customerid 数据初始化一个名为 output 的向量。然后，我遍历所有客户。我有一个名为 customerdatalength 的列。为了减少计算时间，我首先检查列之间的长度是否存在大 (>10) 差异。如果是这种情况，我就不用费心计算 stringdist。否则，如果两个客户之间的距离

不过，我希望加快流程。在 2000 行时，此循环需要 2 分钟。在 7400 行时，此循环需要 32 分钟。我希望在大约 1 000 000 行上运行它。有没有人知道如何提高这个循环的速度？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

data-analysis data-cleaning levenshtein-distance r r stringdist