python记录链接混淆矩阵

问题描述

我正在尝试为刚刚通过 recordlinkage 包拟合的 ECM 模型生成混淆矩阵。由于在不手动标记数据的情况下很难获得训练数据，因此我决定使用无监督 ECM 模型。我遇到的问题是我无法找到“真正的”匹配项，就像文档在 https://recordlinkage.readthedocs.io/en/latest/ref-evaluation.html 处所说的那样。

我发现使用此软件包的最佳示例是此处 https://github.com/J535D165/recordlinkage/blob/master/examples/unsupervised_learning_prob.py#L24。

这是我一直在运行的代码：

import pandas as pd
import recordlinkage as rl
import numpy as np
import re
import string
#from nltk.corpus import stopwords
from collections import Counter
from recordlinkage.index import Block

# load supplemental frame
supp_df = pd.read_csv(path,sep = ",",header = [0])
supp_df.head(10)

# load UI file
ui_df = pd.read_csv(path,header = [0])
ui_df.head(10)

ui_df = ui_df.rename(columns = {'name_primary': 'company','address_street': 'address','address_city': 'city','address_state' : 'state','zip_5' : 'zip'})
#ui_df_copy = pd.DataFrame.copy(ui_df,deep = True)

# load file of abbreviations
abbrev_df = pd.read_csv(path,header = [0])
# convert all column names and columns to lowercase strings
abbrev_df.columns = abbrev_df.columns.str.lower()
abbrev_df = abbrev_df.apply(lambda x: x.astype(str).str.lower())

ui_df.columns = ui_df.columns.str.lower()
ui_df = ui_df.apply(lambda x: x.astype(str).str.lower())

supp_df.columns = supp_df.columns.str.lower()
supp_df = supp_df.apply(lambda x: x.astype(str).str.lower())

# now remove punctuation,special characters and extra white space
def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation,'')
    return text

# remove punctuation
ui_df[["company","name_secondary","address","city","state"]] = ui_df[["company","state"]].applymap(remove_punctuation)
supp_df[["company","state"]] = supp_df[["company","state"]].applymap(remove_punctuation)

# We will also need to remove stop words.  this is from the python package
# nltk,but was manually pasted due to issues with the download.
# It's been supplemented with a smaller list from R version with 'company' etc.
    

# Find most common words in ui_df and supp_df and remove them.
    
common_words = Counter(" ".join(ui_df["company"]).split()).most_common(100)
common_words = pd.DataFrame(common_words,columns = ['word','frequency'])
print(common_words)

# inspect common_words and remove most common but least specific
stop = ['inc','company','co','corporation','corp','incorporated','llc','llp','ltd','and','the']
# remove stopwords
ui_df[['company','name_secondary','address','city','state']] = ui_df[['company','state']].applymap(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
supp_df[['company','state']] = supp_df[['company','state']].applymap(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


# now remove extra white space between two chars
rx = r'(?<=\b[^W\d_])\s+(?=[^\W\d_]\b)'
ui_df = ui_df.replace(rx,'',regex = True)
supp_df = supp_df.replace(rx,regex = True)

# now  substitute usps abbreviations for street addresses to make uniform

abbrev_df['common_abbrev'] = abbrev_df['common_abbrev'].str.strip()
abbrev_df['usps_abbrev'] = abbrev_df['usps_abbrev'].str.strip()

d = dict(zip(abbrev_df.common_abbrev,abbrev_df.usps_abbrev))

ui_df['address'] =  ui_df['address'].replace(d,regex = True)
supp_df['address'] = supp_df['address'].replace(d,regex = True)

# Create 'block' variables of the first character of company name
ui_df['block'] = ui_df['company'].astype(str).str[0]
supp_df['block'] = supp_df['company'].astype(str).str[0]

# create blocking index
indexer = rl.BlockIndex(left_on = 'block',right_on = 'block')
candidate_pairs = indexer.index(ui_df,supp_df)

# now we need to compare our pairs
comp = rl.Compare()

comp.string('company',method = 'jarowinkler',label = 'company',threshold = 0.85)
comp.string('address',label = 'address',threshold = 0.85)
comp.string('city',label = 'city')
comp.string('state','state',label = 'state')
comp.string('zip','zip',label = 'zip')

# return dataframe with feature vectors:

comp_df = comp.compute(candidate_pairs,ui_df,supp_df)

# now we must classify our record pairs as matches or not.  To do that 
 
# binarize comp_df
for col in comp_df.columns:
    comp_df.loc[comp_df[col] >= .5,col] = 1
    comp_df.loc[comp_df[col] < .5,col] = 0


ecm = rl.ECMClassifier()
km = ecm.fit_predict(comp_df)

# now try to evaluate our model

links_pred = ecm.predict(comp_df)

这里是我遇到问题的地方。

cm = rl.confusion_matrix(links_pred,total = len(comp_df))

发生的情况是我收到错误

`TypeError: confusion_matrix() missing 1 required positional argument: 'links_pred'`

但我已经创建了 links_pred，这意味着我需要根据文档获取 links_true。我的问题：如果我没有任何标记数据，我如何获得“真正的匹配”？为什么需要知道记录的匹配状态才能使用无监督学习算法？

任何建议将不胜感激。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

python record-linkage string string