问题描述
我有两个CSV文件需要比较。第一个称为SAP.csv,第二个称为SAPH.csv。
SAP.csv具有以下单元格:
Notification Description
5000000001 Detailed inspection of Masts (2100mm) (3
5000000002 Ceremonial Awnings-Survey and Load Test
5000000003 HPA-Carry out 4000 hour service routine
5000000004 UxE 8 in Number Temperature Probs for C
5000000005 Overhaul valves
...而SAPH.csv具有以下单元格:
Notification Description
4000000015 Detailed inspection of Masts (2100mm) (3
4000000016 Ceremonial Awnings-Survey and Load Test
4000000017 HPA-Carry out 8000 hour service routine
4000000018 UxE 8 in Number Temperature Probs for C
4000000019 Represerve valves
4000000020 STW System
它们是相似的,但是有些方面,例如第四条,( HPA进行 4000 小时的服务程序与HPA进行 8000 小时的服务常规),但略有不同。
我想将SAP.csv的每个值与SAPH.csv的每个值进行比较,并使用余弦相似度找到最相似的行,以便输出看起来像这样(相似度百分比仅是示例,而不是实际的样子):
Description
Detailed inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test - 100%
HPA-Carry out 4000 hour service routine - 85%
UxE 8 in Number Temperature Probs for C - 90%
Overhaul valves - 0%
发布答案编辑
runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py',wdir ='C:/Users/andrew.stillwell2/.spyder-py3')
回溯(最近通话最近一次):
文件“”,第
行第1行runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py',wdir='C:/Users/andrew.stillwell2/.spyder-py3')
runfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,第786行
execfile(filename,namespace)
exec文件中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,行110
exec(compile(f.read(),filename,'exec'),namespace)
中的文件“ C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第31行
similarity_score = similar(job,description) # Get their similarity
第14行中的文件“ C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”
similarity = 1-textdistance.Cosine(qval=2).distance(a,b)
文件173行,距离为“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ base.py”
return self.maximum(*sequences) - self.similarity(*sequences)
类似的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ base.py”,第176行
return self(*sequences)
中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ textdistance \ algorithms \ token_based.py”,第175行
return intersection / pow(prod,1.0 / len(sequences))
ZeroDivisionError:浮点数被零除
由于上述问题的解决方案,进行了第二次编辑
因此,原始请求只有两个输出-Description和Similairty score。
说明来自SAP 相似性来自于文字距离计算
通知(这是SAP文件中的10位数字) 说明(当前为) 相似度(目前如此) 通知(此数字来自SAPH文件,将提供相似性得分)
因此示例行输出会这样
80000115360其他材料FWD护绳器86.24%7123456789
这将沿着A,B,C,D列
A,B来自SAP C计算 D来自SAPH
编辑3
runfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,第786行
execfile(filename,namespace)
中的第16行,文件“ C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py”
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv',dtype={'Notification':'string'})
parser_f中的第702行“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”
return _read(filepath_or_buffer,kwds)
文件_read中的“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”行429
parser = TextFileReader(filepath_or_buffer,**kwds)
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行895,位于 init
self._make_engine(self.engine)
_make_engine中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1122
self._engine = CParserWrapper(self.f,**self.options)
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1853,位于初始
中self._reader = parsers.TextReader(src,**kwds)
pandas._libs.parsers.TextReader中的文件“ pandas / _libs / parsers.pyx”,第490行。初始化
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ common.py”,2017年,pandas_dtype
dtype))
TypeError:数据类型'string'无法理解
发布修改4-25/10/20
嗨,所以收到与我想像中相同的错误
此电子邮件可能包含BAE Systems和/或第三方的专有信息。
runfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder_kernels \ customize \ spydercustomize.py”,行
execfile(filename,dtype={'Notification':'string'},delimiter=",",engine="python")
parser_f中的第702行“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”
return _read(filepath_or_buffer,kwds)
文件_read中的“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”行435
data = parser.read(nrows)
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,第1139行,处于读取状态
ret = self._engine.read(nrows)
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,第2421行,处于读取状态
data = self._convert_data(data)
文件_convert_data中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,第2487行
clean_conv,clean_dtypes)
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1705,位于_convert_to_ndarrays中
cvals = self._cast_types(cvals,cast_type,c)
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ io \ parsers.py”,行1808,以_cast_types
copy=True,skipna=True)
astype_nansafe中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ cast.py”,第623行
dtype = pandas_dtype(dtype)
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ pandas \ core \ dtypes \ common.py”,2017年,pandas_dtype
dtype))
TypeError:数据类型'string'无法理解
我对分隔符有所了解,所以我将一个csv文件上传到repl.it,看起来好像“,”是分隔符。
因此已更改代码以适合。当我在repl.it上执行此操作时,它就起作用了。
这是我正在使用的代码
导入文字距离
将熊猫作为pd导入
def like(a,b):#从此处改编:https://stackoverflow.com/a/63838615/8402369
similarity = 1-textdistance.Cosine(qval=2).distance(a,b)
return similarity * 100
读取CSV
SAP = pd.read_csv('H:\ Documents / Python / Import into Python / SAP / SAP.csv',dtype = {'Notification':'string'},delimiter =“,”,engine =“ python “)
SAPH = pd.read_csv('H:\ Documents / Python / Import into Python / SAP / SAP_History.csv',dtype = {'Notification':'string'},delimiter =“,”,engine =“ python “)
创建一个熊猫数据框以存储输出。 “ Description”列中填充了SAP ['Description']
的值得分= pd.DataFrame(SAP ['Description'],列= ['Notification(SAP)','Description','Similarity','Notification(SAPH)'])
用于存储最高相似度得分的临时变量
highest_score = 0
desc = 0
通过SAP ['Description']进行迭代
用于SAP ['Description']中的工作:
highest_score = 0#在每次迭代中重置high_score
对于SAPH ['Description']中的描述:#遍历SAPH ['Description']
similarity_score = similar(job,description) # Get their similarity
if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so,update highest_score with the new values
highest_score = similarity_score
desc = str(description)
if(similarity_score == 100): # If it's a perfect match,don't bother continuing to search.
break
使用最高分数和其他值更新数据框的“分数”
print(SAPH ['Description'] [SAPH ['Description'] == desc])
scores ['Notification(SAP)'] [scores ['Description'] == job] = SAP ['Notification'] [SAP ['Description'] == job]
scores ['Similarity'] [scores ['Description'] == job] = f'{highest_score}%'
scores ['Notification(SAPH)'] [scores ['Description'] == job] = SAPH ['Notification'] [SAPH ['Description'] == desc]
打印(分数)
不带索引列将其输出到scores.csv
以open('./ scores.csv','w')作为文件:
file.write(scores.__repr__())
正在Spyder(Python 3.7)上运行哪个
解决方法
@George_Pipas's answer至this question展示了一个使用库textdistance
的示例(我在这里解释他的答案的一部分):
一种解决方案是使用
的示例textdistance
库。我将提供一个Cosine Similarity
import textdistance 1-textdistance.Cosine(qval=2).distance('Apple','Appel')
我们得到:
0.5
因此,我们可以创建一个相似性查找功能:
def similar(a,b):
similarity = 1-textdistance.Cosine(qval=2).distance(a,b)
return similarity
根据相似性,如果a
和b
更相似,则输出接近1的数字,如果不相似,则输出接近0的数字。因此,如果使用a === b
,则输出将为1
,但是如果使用a !== b
,则输出将小于1。
要获取百分比,您只需要将输出乘以100即可。
def similar(a,b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
similarity = 1-textdistance.Cosine(qval=2).distance(a,b)
return similarity * 100
使用pandas
可以很容易地读取CSV文件:
# Read the CSVs
SAP = pd.read_csv('SAP.csv')
SAPH = pd.read_csv('SAPH.csv')
我们创建另一个pandas dataframe来存储我们将在其中计算的结果:
# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']},columns = ['SAP','SAPH','Similarity'])
现在,我们遍历SAP['Description']
和SAPH['Description']
,将每个元素相互比较,计算它们的相似度,然后将最高的保存到scores
。
# Temporary variable to store both the highest similarity score,and the 'SAPH' value the score was computed with
highest_score = {"score": 0,"description": ""}
# Iterate though SAP['Description']
for job in SAP['Description']:
highest_score = {"score": 0,"description": ""} # Reset highest_score at each iteration
for description in SAPH['Description']: # Iterate through SAPH['Description']
similarity_score = similar(job,description) # Get their similarity
if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so,update highest_score with the new values
highest_score['score'] = similarity_score
highest_score['description'] = description
if(similarity_score == 100): # If it's a perfect match,don't bother continuing to search.
break
# Update the dataframe 'scores' with highest_score
scores['SAPH'][scores['SAP'] == job] = highest_score['description']
scores['Similarity'][scores['SAP'] == job] = highest_score['score']
以下是细分:
- 创建一个临时变量
highest_score
,以存储最高的计算分数。 - 现在,我们遍历
SAP['Description']
,在内部遍历SAPH['Description']
。这使我们能够将SAP['Description']
(job
)的每个值与SAPH['Description']
(description
)的每个值进行比较。 - 通过
SAPH['Description']
进行迭代时,我们:- 计算
job
和description
的相似度得分 - 如果它高于
highest_score
中保存的分数,我们将相应地更新highest_score
;否则我们会继续 - 如果
similarity_score
等于100
,我们知道这是一个完美的匹配,不必继续寻找。在这种情况下,我们打破了循环。
- 计算
- 在
SAPH['Description']
循环之外,现在我们已经将job
与SAPH['Description']
的每个元素进行了比较(或找到了完美匹配),我们将值保存到{{1 }}。
此操作对scores
的每个元素重复。
SAP['Description']
如下所示:
scores
并使用以下命令将其输出到CSV文件后:
SAP SAPH Similarity
0 Detailed Inspection of Masts (2100mm) (3 Detailed Inspection of Masts (2100mm) (3 100
1 Ceremonial Awnings-Survey and Load Test Ceremonial Awnings-Survey and Load Test 100
2 HPA-Carry out 4000 hour service routine HPA-Carry out 8000 hour service routine 94.7368
3 UxE 8 in Number Temperature Probs for C UxE 8 in Number Temperature Probs for C 100
4 Overhaul valves Represerve valves 53.4522
... Scores.csv 看起来像这样:
# Output it to Scores.csv without the index column (0,1,2,3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv',index=False)
View the full code,and run and edit it online
请注意 SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488
和 textdistance
是为此所需的库。如果还没有安装它们,请使用:
pandas
注释:
- 您可以通过将
pip install textdistance pandas
替换为f'{highest_score}%'
来舍入百分比
- Here's a formatted version和here's the code
编辑 :(针对注释中提到的问题)
以下是相似性功能的错误捕捉版本:
f'{round(highest_score,NUMBER_OF_PLACES_TO_ROUND_TO)}%'