比较两个csv文件,并在新的csv文件中获得匹配或不匹配的输出

问题描述

我有两个csv文件一个是profile.csv,另一个是data.csv文件。 profile.csv在两列下有数据,例如profile.csv,例如company_name和job_description。 data.csv文件的数据在两列下,例如data.csv这样的company_name和job_description。

我想要的是profile.csv的描述(限定),必须与data.csv的描述进行比较。并获取每个描述(资格)是否匹配的输出...

在我看来,输出必须像这样

公司----------------------

PPD GLOBAL LTD

job_description ---------

对一门科学学科的学士/高级学位的教育-匹配

具有法规医学写作经验-匹配

出色的语法,编辑和校对技巧-匹配

有效的组织和计划技能-匹配

动机,主动性和适应性 在团队中有效工作的能力-匹配

到目前为止,我已经尝试过了

它只匹配整个job_description而不是每个句子...

import csv

with open('C:\\Users\\Izzath  Ali\\Desktop\\Data Mining\\profile.csv','rt',encoding='utf-8') as csvfile1:
    csvfile1_indices = dict((r[1],i) for i,r in enumerate(csv.reader(csvfile1)))

with open('C:\\Users\\Izzath  Ali\\Desktop\\Data Mining\\data.csv',encoding='utf-8') as csvfile2:
    with open('outputText-mining.csv','w') as results:
        reader = csv.reader(csvfile2)
        writer = csv.writer(results)

        writer.writerow(next(reader,[]) + ['status'])

        for row in reader:
            index = csvfile1_indices.get(row[1])
            if index is not None:
               message = '-- matching'
               writer.writerow(row + [message])

            else:
               message = '-- not matching'
               writer.writerow(row + [message])

 results.close()

解决方法

我将简化您的数据结构以启用演示:

file1 = """company1,"sent1. sent2. sent3"
company2,"sent4. sent5. sent6"
company3,"sent7. sent8. sent9"
"""

file2 = """companyA,"sent1. sent20. sent3"
companyB,"sent40. sent5."
companyC,"sent5. sent1. sent60"
"""

首先,我将数据加载到数据结构中-字典列表,每个公司一个字典。

list_of_file1_company_dicts = []
for line in file1.split('\n'):
    company_dict = {}
    col = line.split(',')
    print('company:',col[0])
    list_of_sent = col[1].split('.')
    company_dict[col[0]] = list_of_sent
list_of_file1_company_dicts.append(company_dict)

list_of_file2_company_dicts = []
for line in file2.split('\n'):
    company_dict = {}
    col = line.split(',col[0])
    list_of_sent = col[1].split('.')
    company_dict[col[0]] = list_of_sent
list_of_file2_company_dicts.append(company_dict)

然后循环遍历两个数据结构以查找字典值的交集

for file1_company_dict in list_of_file1_company_dicts:
    for company1_name,list_of_sent1 in file1_company_dict.items():
        for sent1 in list_of_sent1:
            for file2_company_dict in list_of_file2_company_dicts:
                for company2_name,list_of_sent2 in file2_company_dict.items():
                    for sent2 in list_of_sent2:
                        if sent1==sent2:
                            print(company1_name,company2_name)