将CSV文件的每个元素与不同CSV文件的每个元素进行比较,并找到最相似的元素

问题描述

我有两个CSV文件需要比较。第一个称为SAP.csv,第二个称为SAPH.csv。

SAP.csv具有以下单元格:

Notification    Description
5000000001      Detailed inspection of Masts (2100mm) (3
5000000002      Ceremonial Awnings-Survey and Load Test
5000000003      HPA-Carry out 4000 hour service routine
5000000004      UxE 8 in Number Temperature Probs for C
5000000005      Overhaul valves

...而SAPH.csv具有以下单元格:

Notification   Description
4000000015     Detailed inspection of Masts (2100mm) (3
4000000016     Ceremonial Awnings-Survey and Load Test
4000000017     HPA-Carry out 8000 hour service routine
4000000018     UxE 8 in Number Temperature Probs for C
4000000019     Represerve valves
4000000020     STW System

它们是相似的,但是有些方面,例如第四条,( HPA进行 4000 小时的服务程序与HPA进行 8000 小时的服务常规),但略有不同。

我想将SAP.csv的每个值与SAPH.csv的每个值进行比较,并使用余弦相似度找到最相似的行,以便输出看起来像这样(相似度百分比仅是示例,而不是实际的样子):

Description
Detailed inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test  - 100%
HPA-Carry out 4000 hour service routine  - 85%
UxE 8 in Number Temperature Probs for C  - 90%
Overhaul valves                          - 0%

发布答案编辑

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py',wdir ='C:/Users/andrew.stillwell2/.spyder-py3')

回溯(最近通话最近一次):

文件“”,第

行第1行
runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py',wdir='C:/Users/andrew.stillwell2/.spyder-py3')

runfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,第786行

execfile(filename,namespace)

exec文件中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,行110

exec(compile(f.read(),filename,'exec'),namespace)

文件

中的文件“ C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第31行
similarity_score = similar(job,description) # Get their similarity

第14行中的文件“ C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”

similarity = 1-textdistance.Cosine(qval=2).distance(a,b)

文件173行,距离为“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ base.py”

return self.maximum(*sequences) - self.similarity(*sequences)

类似的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ base.py”,第176行

return self(*sequences)

调用

中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ token_based.py”,第175行
return intersection / pow(prod,1.0 / len(sequences))

ZeroDivisionError:浮点数被零除

由于上述问题的解决方案,进行了第二次编辑

因此,原始请求只有两个输出-Description和Similairty score

说明来自SAP 相似性来自于文字距离计算

可以将解决方修改为以下内容

通知(这是SAP文件中的10位数字) 说明(当前为) 相似度(目前如此) 通知(此数字来自SAPH文件,将提供相似性得分)

因此示例行输出会这样

80000115360其他材料FWD护绳器86.24%7123456789

这将沿着A,B,C,D列

A,B来自SAP C计算 D来自SAPH

编辑3

runfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,第786行

execfile(filename,namespace)

文件

中的第16行,文件“ C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py”
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv',dtype={'Notification':'string'})

parser_f中的第702行“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”

return _read(filepath_or_buffer,kwds)

文件_read中的“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”行429

parser = TextFileReader(filepath_or_buffer,**kwds)

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行895,位于 init

self._make_engine(self.engine)

_make_engine中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1122

self._engine = CParserWrapper(self.f,**self.options)

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1853,位于初始

self._reader = parsers.TextReader(src,**kwds)

pandas._libs.parsers.TextReader中的文件“ pandas / _libs / parsers.pyx”,第490行。初始化

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ common.py”,2017年,pandas_dtype

dtype))

TypeError:数据类型'string'无法理解

发布修改4-25/10/20

嗨,所以收到与我想像中相同的错误

此电子邮件可能包含BAE Systems和/或第三方的专有信息。

runfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,行

execfile(filename,dtype={'Notification':'string'},delimiter=",",engine="python")

parser_f中的第702行“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”

return _read(filepath_or_buffer,kwds)

文件_read中的“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”行435

data = parser.read(nrows)

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,第1139行,处于读取状态

ret = self._engine.read(nrows)

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,第2421行,处于读取状态

data = self._convert_data(data)

文件_convert_data中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,第2487行

clean_conv,clean_dtypes)

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1705,位于_convert_to_ndarrays中

cvals = self._cast_types(cvals,cast_type,c)

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1808,以_cast_types

copy=True,skipna=True)

astype_nansafe中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ cast.py”,第623行

dtype = pandas_dtype(dtype)

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ common.py”,2017年,pandas_dtype

dtype))

TypeError:数据类型'string'无法理解

我对分隔符有所了解,所以我将一个csv文件上传到repl.it,看起来好像“,”是分隔符。

因此已更改代码以适合。当我在repl.it上执行此操作时,它就起作用了。

这是我正在使用的代码

导入文字距离

将熊猫作为pd导入

def like(a,b):#从此处改编:https://stackoverflow.com/a/63838615/8402369

similarity = 1-textdistance.Cosine(qval=2).distance(a,b)

return similarity * 100

读取CSV

SAP = pd.read_csv('H:\ Documents / Python / Import into Python / SAP / SAP.csv',dtype = {'Notification':'string'},delimiter =“,”,engine =“ python “)

SAPH = pd.read_csv('H:\ Documents / Python / Import into Python / SAP / SAP_History.csv',dtype = {'Notification':'string'},delimiter =“,”,engine =“ python “)

创建一个熊猫数据框以存储输出。 “ Description”列中填充了SAP ['Description']

的值

得分= pd.DataFrame(SAP ['Description'],列= ['Notification(SAP)','Description','Similarity','Notification(SAPH)'])

用于存储最高相似度得分的临时变量

highest_score = 0

desc = 0

通过SAP ['Description']进行迭代

用于SAP ['Description']中的工作:

highest_score = 0#在每次迭代中重置high_score

对于SAPH ['Description']中的描述:#遍历SAPH ['Description']

similarity_score = similar(job,description) # Get their similarity



if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so,update highest_score with the new values

  highest_score = similarity_score

  desc = str(description)

if(similarity_score == 100): # If it's a perfect match,don't bother continuing to search.

  break

使用最高分数和其他值更新数据框的“分数”

print(SAPH ['Description'] [SAPH ['Description'] == desc])

scores ['Notification(SAP)'] [scores ['Description'] == job] = SAP ['Notification'] [SAP ['Description'] == job]

scores ['Similarity'] [scores ['Description'] == job] = f'{highest_score}%'

scores ['Notification(SAPH)'] [scores ['Description'] == job] = SAPH ['Notification'] [SAPH ['Description'] == desc]

打印(分数)

不带索引列将其输出scores.csv

以open('./ scores.csv','w')作为文件

file.write(scores.__repr__())

正在Spyder(Python 3.7)上运行哪个

解决方法

@George_Pipas's answerthis question展示了一个使用库textdistance的示例(我在这里解释他的答案的一部分):

一种解决方案是使用textdistance库。我将提供一个Cosine Similarity

的示例
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple','Appel')

我们得到:

0.5

因此,我们可以创建一个相似性查找功能:

def similar(a,b):
    similarity = 1-textdistance.Cosine(qval=2).distance(a,b)     
    return similarity

根据相似性,如果ab更相似,则输出接近1的数字,如果不相似,则输出接近0的数字。因此,如果使用a === b,则输出将为1,但是如果使用a !== b,则输出将小于1。

要获取百分比,您只需要将输出乘以100即可。

def similar(a,b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
    similarity = 1-textdistance.Cosine(qval=2).distance(a,b) 
    return similarity * 100

使用pandas可以很容易地读取CSV文件:

# Read the CSVs
SAP = pd.read_csv('SAP.csv') 
SAPH = pd.read_csv('SAPH.csv')

我们创建另一个pandas dataframe来存储我们将在其中计算的结果:

# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']},columns = ['SAP','SAPH','Similarity']) 

现在,我们遍历SAP['Description']SAPH['Description'],将每个元素相互比较,计算它们的相似度,然后将最高的保存到scores

# Temporary variable to store both the highest similarity score,and the 'SAPH' value the score was computed with
highest_score = {"score": 0,"description": ""}

# Iterate though SAP['Description']
for job in SAP['Description']:
  highest_score = {"score": 0,"description": ""} # Reset highest_score at each iteration
  for description in SAPH['Description']: # Iterate through SAPH['Description']
    similarity_score = similar(job,description) # Get their similarity

    if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so,update highest_score with the new values
      highest_score['score'] = similarity_score
      highest_score['description'] = description
    if(similarity_score == 100): # If it's a perfect match,don't bother continuing to search.
      break
  # Update the dataframe 'scores' with highest_score
  scores['SAPH'][scores['SAP'] == job] = highest_score['description'] 
  scores['Similarity'][scores['SAP'] == job] = highest_score['score']

以下是细分:

  1. 创建一个临时变量highest_score,以存储最高的计算分数。
  2. 现在,我们遍历SAP['Description'],在内部遍历SAPH['Description']。这使我们能够将SAP['Description']job)的每个值与SAPH['Description']description)的每个值进行比较。
  3. 通过SAPH['Description']进行迭代时,我们:
    1. 计算jobdescription的相似度得分
    2. 如果它高于highest_score中保存的分数,我们将相应地更新highest_score;否则我们会继续
    3. 如果similarity_score等于100,我们知道这是一个完美的匹配,不必继续寻找。在这种情况下,我们打破了循环。
  4. SAPH['Description']循环之外,现在我们已经将jobSAPH['Description']的每个元素进行了比较(或找到了完美匹配),我们将值保存到{{1 }}。

此操作对scores的每个元素重复。

SAP['Description']如下所示:

scores

并使用以下命令将其输出到CSV文件后:

                                        SAP                                      SAPH Similarity
0  Detailed Inspection of Masts (2100mm) (3  Detailed Inspection of Masts (2100mm) (3        100
1   Ceremonial Awnings-Survey and Load Test   Ceremonial Awnings-Survey and Load Test        100
2   HPA-Carry out 4000 hour service routine   HPA-Carry out 8000 hour service routine    94.7368
3   UxE 8 in Number Temperature Probs for C   UxE 8 in Number Temperature Probs for C        100
4                           Overhaul valves                         Represerve valves    53.4522

... Scores.csv 看起来像这样:

# Output it to Scores.csv without the index column (0,1,2,3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv',index=False)

View the full code,and run and edit it online

请注意 SAP,SAPH,Similarity Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100 Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100 HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315 UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100 Overhaul valves,Represerve valves,53.45224838248488 textdistance 是为此所需的库。如果还没有安装它们,请使用:

pandas

注释:


编辑 :(针对注释中提到的问题)

以下是相似性功能的错误捕捉版本:

f'{round(highest_score,NUMBER_OF_PLACES_TO_ROUND_TO)}%'