问题描述
我有两个csv文件。一个是profile.csv,另一个是data.csv文件。 profile.csv在两列下有数据,例如profile.csv,例如company_name和job_description。 data.csv文件的数据在两列下,例如data.csv这样的company_name和job_description。
我想要的是profile.csv的描述(限定),必须与data.csv的描述进行比较。并获取每个描述(资格)是否匹配的输出...
在我看来,输出必须像这样
公司----------------------
PPD GLOBAL LTD
job_description ---------
对一门科学学科的学士/高级学位的教育-匹配
具有法规医学写作经验-匹配
出色的语法,编辑和校对技巧-匹配
有效的组织和计划技能-匹配
动机,主动性和适应性 在团队中有效工作的能力-匹配
到目前为止,我已经尝试过了
它只匹配整个job_description而不是每个句子...
import csv
with open('C:\\Users\\Izzath Ali\\Desktop\\Data Mining\\profile.csv','rt',encoding='utf-8') as csvfile1:
csvfile1_indices = dict((r[1],i) for i,r in enumerate(csv.reader(csvfile1)))
with open('C:\\Users\\Izzath Ali\\Desktop\\Data Mining\\data.csv',encoding='utf-8') as csvfile2:
with open('outputText-mining.csv','w') as results:
reader = csv.reader(csvfile2)
writer = csv.writer(results)
writer.writerow(next(reader,[]) + ['status'])
for row in reader:
index = csvfile1_indices.get(row[1])
if index is not None:
message = '-- matching'
writer.writerow(row + [message])
else:
message = '-- not matching'
writer.writerow(row + [message])
results.close()
解决方法
我将简化您的数据结构以启用演示:
file1 = """company1,"sent1. sent2. sent3"
company2,"sent4. sent5. sent6"
company3,"sent7. sent8. sent9"
"""
file2 = """companyA,"sent1. sent20. sent3"
companyB,"sent40. sent5."
companyC,"sent5. sent1. sent60"
"""
首先,我将数据加载到数据结构中-字典列表,每个公司一个字典。
list_of_file1_company_dicts = []
for line in file1.split('\n'):
company_dict = {}
col = line.split(',')
print('company:',col[0])
list_of_sent = col[1].split('.')
company_dict[col[0]] = list_of_sent
list_of_file1_company_dicts.append(company_dict)
list_of_file2_company_dicts = []
for line in file2.split('\n'):
company_dict = {}
col = line.split(',col[0])
list_of_sent = col[1].split('.')
company_dict[col[0]] = list_of_sent
list_of_file2_company_dicts.append(company_dict)
然后循环遍历两个数据结构以查找字典值的交集
for file1_company_dict in list_of_file1_company_dicts:
for company1_name,list_of_sent1 in file1_company_dict.items():
for sent1 in list_of_sent1:
for file2_company_dict in list_of_file2_company_dicts:
for company2_name,list_of_sent2 in file2_company_dict.items():
for sent2 in list_of_sent2:
if sent1==sent2:
print(company1_name,company2_name)