阻止记录链接python

问题描述

我有两个dataframes，df1和df2共享多列。请参见下面的每个数据框摘要。

df1

ps_number   ps_name cons_type   cons_number province    cons_type_num   district    ps_name_clean
0   1   Government Girls Primary School Comboo (Male)   NA  1   KPK NA-1    peshawar-i  government girls primary school comboo male
1   2   Government Girls Middle School Sarbaland Pura ...   NA  1   KPK NA-1    peshawar-i  government girls middle school sarbaland pura ...
2   3   Government Primary School No.1 Sarbaland Pura ...   NA  1   KPK NA-1    peshawar-i  government primary school no1 sarbaland pura m...
3   4   Government Girls Primary School Sarbaland Pura...   NA  1   KPK NA-1    peshawar-i  government girls primary school sarbaland pura...
4   5   Government Primary School Pahari Pura (Male) PS-1   NA  1   KPK NA-1    peshawar-i  government primary school pahari pura male ps 1
... ... ... ... ... ... ... ... ...
68346   304 Govt Girls Meddal School Musi Wali (Female) (P) NA  72  PUNJAB  NA-72   mianwali-ii govt girls meddal school musi wali female p
68347   305 Govt Boys High School Musi Wali Part-II (P) NA  72  PUNJAB  NA-72   mianwali-ii govt boys high school musi wali part ii p
68348   306 Govt GirlsMeddal School Tibba Mehr Ban Shah (P) NA  72  PUNJAB  NA-72   mianwali-ii govt girlsmeddal school tibba mehr ban shah p
68349   307 Govt Boys Pramiery School Murid Abbas Shah (P)  NA  72  PUNJAB  NA-72   mianwali-ii govt boys pramiery school murid abbas shah p
68350   308 Total Votes recorded on postal ballot for the ...   NA  72  PUNJAB  NA-72   mianwali-ii total Votes recorded on postal ballot for the ...

df2
    ps_name ps_number   province    cons_number type    lat lng description district    ps_name_stripped    ps_name_clean
0   Govt; High School (GHS) Arrandu 1   KPK 1   Combined    35.31008    71.55092    Polling Station: 1 - Govt; High School (GHS) A...   chitral Govt; High School (GHS) Arrandu govt high school ghs arrandu
1   Govt Girls Primary School (GGPS) Arrandu    2   KPK 1   Combined    35.31202    71.55158    Polling Station: 2 - Govt Girls Primary School...   chitral Govt Girls Primary School (GGPS) Arrandu    govt girls primary school ggps arrandu
2   Govt: Primary School Marakabat (PS No.1)    3   KPK 1   Combined    35.29211    71.56821    Polling Station: 3 - Govt: Primary School Mara...   chitral Govt: Primary School Marakabat (PS No.1)    govt primary school marakabat ps no1
3   Govt; Primary School Akroy  4   KPK 1   Combined    35.34268    71.61663    Polling Station: 4 - Govt; Primary School Akro...   chitral Govt; Primary School Akroy  govt primary school akroy
4   Govt Primary School Langoorbat  5   KPK 1   Combined    35.32656    71.59922    Polling Station: 5 - Govt Primary School Lango...   chitral Govt Primary School Langoorbat  govt primary school langoorbat
... ... ... ... ... ... ... ... ... ... ... ...
46814   Govt: Primary School Basool Ormara. (Comb)  358 BALOCHISTAN 272 Combined    25.43741    64.38839    Polling Station: 358 - Govt: Primary School Ba...   lasbela-cum-gwadar  Govt: Primary School Basool Ormara. (Comb)  govt primary school basool ormara comb
46815   Govt: Primary School Kahordan (Comb)    359 BALOCHISTAN 272 Combined    25.42091    64.41979    Polling Station: 359 - Govt: Primary School Ka...   lasbela-cum-gwadar  Govt: Primary School Kahordan (Comb)    govt primary school kahordan comb
46816   Govt: Primary School Thussak (Comb) 360 BALOCHISTAN 272 Combined    25.45371    64.48357    Polling Station: 360 - Govt: Primary School Th...   lasbela-cum-gwadar  Govt: Primary School Thussak (Comb) govt primary school thussak comb
46817   Govt:Boys Meddle School Chill Hud Ormara. (Comb)    364 BALOCHISTAN 272 Combined    25.24679    64.60839    Polling Station: 364 - Govt:Boys Meddle School...   lasbela-cum-gwadar  Govt:Boys Meddle School Chill Hud Ormara. (Comb)    govtboys meddle school chill hud ormara comb
46818   Govt: Boys Primary School Hud (Comb)    365 BALOCHISTAN 272 Combined    25.28109    64.64944    Polling Station: 365 - Govt: Boys Primary Sch...    lasbela-cum-gwadar  Govt: Boys Primary School Hud (Comb)    govt boys primary school hud comb

我试图对dataframes上的两个ps_name_clean进行字符串匹配，而对district进行屏蔽。

问题是，两个dataframes并不完全共享相同的区。即df1具有df2没有的区域，反之亦然。在这种情况下，我的问题是，在字符串匹配期间df1中那些具有df2没有的区域的行会怎样？它们是否仍会与其他地区的所有行进行比较？

在下面运行后，我手动进行了检查，以查看这些行发生了什么，并且它们似乎确实与与它们不在同一地区的其他行以某种方式进行了匹配（但是不准确的匹配）。但是我不明白为什么，因为我封锁了地区，据我所知，这意味着只有相同地区的人才能相互比较。

indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1,df2)
c = recordlinkage.Compare()
c.string('ps_name_clean','ps_name_clean',method='damerau_levenshtein',threshold=0.75)
c_vectors = c.compute(candidate_links,df1,df2)
matches = c_vectors[c_vectors[0]==1.0]
matches.reset_index(inplace=True) 
matches.drop_duplicates(subset = ['level_0'],keep = 'last')

有人可以解释这里发生了什么吗？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

dataframe pandas python record-linkage string-matching