问题描述

我有两个CSV文件需要比较。第一个称为SAP.csv，第二个称为SAPH.csv。

SAP.csv具有以下单元格：

Notification    Description
5000000001      Detailed inspection of Masts (2100mm) (3
5000000002      Ceremonial Awnings-Survey and Load Test
5000000003      HPA-Carry out 4000 hour service routine
5000000004      UxE 8 in Number Temperature Probs for C
5000000005      Overhaul valves

...而SAPH.csv具有以下单元格：

Notification   Description
4000000015     Detailed inspection of Masts (2100mm) (3
4000000016     Ceremonial Awnings-Survey and Load Test
4000000017     HPA-Carry out 8000 hour service routine
4000000018     UxE 8 in Number Temperature Probs for C
4000000019     Represerve valves
4000000020     STW System

它们是相似的，但是有些方面，例如第四条，（ HPA进行 4000 小时的服务程序与HPA进行 8000 小时的服务常规），但略有不同。

我想将SAP.csv的每个值与SAPH.csv的每个值进行比较，并使用余弦相似度找到最相似的行，以便输出看起来像这样（相似度百分比仅是示例，而不是实际的样子）：

Description
Detailed inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test  - 100%
HPA-Carry out 4000 hour service routine  - 85%
UxE 8 in Number Temperature Probs for C  - 90%
Overhaul valves                          - 0%

发布答案编辑

runfile（'C：/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py'，wdir ='C：/Users/andrew.stillwell2/.spyder-py3'）

回溯（最近通话最近一次）：

文件“”，第

行第1行

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py',wdir='C:/Users/andrew.stillwell2/.spyder-py3')

runfile中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”，第786行

execfile(filename,namespace)

exec文件中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”，行110

exec(compile(f.read(),filename,'exec'),namespace)

文件

中的文件“ C：/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”，第31行

similarity_score = similar(job,description) # Get their similarity

第14行中的文件“ C：/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”

similarity = 1-textdistance.Cosine(qval=2).distance(a,b)

文件173行，距离为“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ base.py”

return self.maximum(*sequences) - self.similarity(*sequences)

类似的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ base.py”，第176行

return self(*sequences)

调用

中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ token_based.py”，第175行

return intersection / pow(prod,1.0 / len(sequences))

ZeroDivisionError：浮点数被零除

由于上述问题的解决方案，进行了第二次编辑

因此，原始请求只有两个输出-Description和Similairty score。

说明来自SAP 相似性来自于文字距离计算

可以将解决方案修改为以下内容

通知（这是SAP文件中的10位数字）说明（当前为）相似度（目前如此）通知（此数字来自SAPH文件，将提供相似性得分）

因此示例行输出会这样

80000115360其他材料FWD护绳器86.24％7123456789

这将沿着A，B，C，D列

A，B来自SAP C计算 D来自SAPH

编辑3

runfile中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”，第786行

execfile(filename,namespace)

文件

中的第16行，文件“ C：/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py”

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv',dtype={'Notification':'string'})

parser_f中的第702行“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”

return _read(filepath_or_buffer,kwds)

文件_read中的“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”行429

parser = TextFileReader(filepath_or_buffer,**kwds)

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，行895，位于 init

self._make_engine(self.engine)

_make_engine中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，行1122

self._engine = CParserWrapper(self.f,**self.options)

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，行1853，位于初始

中

self._reader = parsers.TextReader(src,**kwds)

pandas._libs.parsers.TextReader中的文件“ pandas / _libs / parsers.pyx”，第490行。初始化

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ common.py”，2017年，pandas_dtype

dtype))

TypeError：数据类型'string'无法理解

发布修改4-25/10/20

嗨，所以收到与我想像中相同的错误

此电子邮件可能包含BAE Systems和/或第三方的专有信息。

runfile中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”，行

execfile(filename,dtype={'Notification':'string'},delimiter=",",engine="python")

parser_f中的第702行“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”

return _read(filepath_or_buffer,kwds)

文件_read中的“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”行435

data = parser.read(nrows)

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，第1139行，处于读取状态

ret = self._engine.read(nrows)

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，第2421行，处于读取状态

data = self._convert_data(data)

文件_convert_data中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，第2487行

clean_conv,clean_dtypes)

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，行1705，位于_convert_to_ndarrays中

cvals = self._cast_types(cvals,cast_type,c)

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”，行1808，以_cast_types

copy=True,skipna=True)

astype_nansafe中的文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ cast.py”，第623行

dtype = pandas_dtype(dtype)

文件“ C：\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ common.py”，2017年，pandas_dtype

dtype))

TypeError：数据类型'string'无法理解

我对分隔符有所了解，所以我将一个csv文件上传到repl.it，看起来好像“，”是分隔符。

因此已更改代码以适合。当我在repl.it上执行此操作时，它就起作用了。

这是我正在使用的代码

导入文字距离

将熊猫作为pd导入

def like（a，b）：＃从此处改编：https://stackoverflow.com/a/63838615/8402369

similarity = 1-textdistance.Cosine(qval=2).distance(a,b)

return similarity * 100

读取CSV

SAP = pd.read_csv（'H：\ Documents / Python / Import into Python / SAP / SAP.csv'，dtype = {'Notification'：'string'}，delimiter =“，”，engine =“ python “）

SAPH = pd.read_csv（'H：\ Documents / Python / Import into Python / SAP / SAP_History.csv'，dtype = {'Notification'：'string'}，delimiter =“，”，engine =“ python “）

创建一个熊猫数据框以存储输出。 “ Description”列中填充了SAP ['Description']

的值

得分= pd.DataFrame（SAP ['Description']，列= ['Notification（SAP）'，'Description'，'Similarity'，'Notification（SAPH）']）

用于存储最高相似度得分的临时变量

highest_score = 0

desc = 0

通过SAP ['Description']进行迭代

用于SAP ['Description']中的工作：

highest_score = 0＃在每次迭代中重置high_score

对于SAPH ['Description']中的描述：＃遍历SAPH ['Description']

similarity_score = similar(job,description) # Get their similarity



if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so,update highest_score with the new values

  highest_score = similarity_score

  desc = str(description)

if(similarity_score == 100): # If it's a perfect match,don't bother continuing to search.

  break

使用最高分数和其他值更新数据框的“分数”

print（SAPH ['Description'] [SAPH ['Description'] == desc]）

scores ['Notification（SAP）'] [scores ['Description'] == job] = SAP ['Notification'] [SAP ['Description'] == job]

scores ['Similarity'] [scores ['Description'] == job] = f'{highest_score}％'

scores ['Notification（SAPH）'] [scores ['Description'] == job] = SAPH ['Notification'] [SAPH ['Description'] == desc]

打印（分数）

不带索引列将其输出到scores.csv

以open（'./ scores.csv'，'w'）作为文件：

file.write(scores.__repr__())

正在Spyder（Python 3.7）上运行哪个

解决方法

@George_Pipas's answer至this question展示了一个使用库textdistance的示例（我在这里解释他的答案的一部分）：

一种解决方案是使用textdistance库。我将提供一个Cosine Similarity
的示例
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple','Appel')
我们得到：
0.5

因此，我们可以创建一个相似性查找功能：

def similar(a,b):
    similarity = 1-textdistance.Cosine(qval=2).distance(a,b)     
    return similarity

根据相似性，如果a和b更相似，则输出接近1的数字，如果不相似，则输出接近0的数字。因此，如果使用a === b，则输出将为1，但是如果使用a !== b，则输出将小于1。

要获取百分比，您只需要将输出乘以100即可。

def similar(a,b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
    similarity = 1-textdistance.Cosine(qval=2).distance(a,b) 
    return similarity * 100

使用pandas可以很容易地读取CSV文件：

# Read the CSVs
SAP = pd.read_csv('SAP.csv') 
SAPH = pd.read_csv('SAPH.csv')

我们创建另一个pandas dataframe来存储我们将在其中计算的结果：

# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']},columns = ['SAP','SAPH','Similarity'])

现在，我们遍历SAP['Description']和SAPH['Description']，将每个元素相互比较，计算它们的相似度，然后将最高的保存到scores。

# Temporary variable to store both the highest similarity score,and the 'SAPH' value the score was computed with
highest_score = {"score": 0,"description": ""}

# Iterate though SAP['Description']
for job in SAP['Description']:
  highest_score = {"score": 0,"description": ""} # Reset highest_score at each iteration
  for description in SAPH['Description']: # Iterate through SAPH['Description']
    similarity_score = similar(job,description) # Get their similarity

    if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so,update highest_score with the new values
      highest_score['score'] = similarity_score
      highest_score['description'] = description
    if(similarity_score == 100): # If it's a perfect match,don't bother continuing to search.
      break
  # Update the dataframe 'scores' with highest_score
  scores['SAPH'][scores['SAP'] == job] = highest_score['description'] 
  scores['Similarity'][scores['SAP'] == job] = highest_score['score']

以下是细分：

创建一个临时变量highest_score，以存储最高的计算分数。
现在，我们遍历SAP['Description']，在内部遍历SAPH['Description']。这使我们能够将SAP['Description']（job）的每个值与SAPH['Description']（description）的每个值进行比较。
通过SAPH['Description']进行迭代时，我们：
1. 计算job和description的相似度得分
2. 如果它高于highest_score中保存的分数，我们将相应地更新highest_score；否则我们会继续
3. 如果similarity_score等于100，我们知道这是一个完美的匹配，不必继续寻找。在这种情况下，我们打破了循环。
在SAPH['Description']循环之外，现在我们已经将job与SAPH['Description']的每个元素进行了比较（或找到了完美匹配），我们将值保存到{{1 }}。

此操作对scores的每个元素重复。

SAP['Description']如下所示：

scores

并使用以下命令将其输出到CSV文件后：

                                        SAP                                      SAPH Similarity
0  Detailed Inspection of Masts (2100mm) (3  Detailed Inspection of Masts (2100mm) (3        100
1   Ceremonial Awnings-Survey and Load Test   Ceremonial Awnings-Survey and Load Test        100
2   HPA-Carry out 4000 hour service routine   HPA-Carry out 8000 hour service routine    94.7368
3   UxE 8 in Number Temperature Probs for C   UxE 8 in Number Temperature Probs for C        100
4                           Overhaul valves                         Represerve valves    53.4522

... Scores.csv 看起来像这样：

# Output it to Scores.csv without the index column (0,1,2,3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv',index=False)

View the full code,and run and edit it online

请注意 SAP,SAPH,Similarity Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100 Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100 HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315 UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100 Overhaul valves,Represerve valves,53.45224838248488 和 textdistance 是为此所需的库。如果还没有安装它们，请使用：

pandas

注释：

您可以通过将pip install textdistance pandas替换为f'{highest_score}%'
Here's a formatted version和here's the code

编辑：（针对注释中提到的问题）

以下是相似性功能的错误捕捉版本：

f'{round(highest_score,NUMBER_OF_PLACES_TO_ROUND_TO)}%'

将CSV文件的每个元素与不同CSV文件的每个元素进行比较，并找到最相似的元素