问题描述
我有两个dataframes
,df1和df2共享多列。请参见下面的每个数据框摘要。
df1
ps_number ps_name cons_type cons_number province cons_type_num district ps_name_clean
0 1 Government Girls Primary School Comboo (Male) NA 1 KPK NA-1 peshawar-i government girls primary school comboo male
1 2 Government Girls Middle School Sarbaland Pura ... NA 1 KPK NA-1 peshawar-i government girls middle school sarbaland pura ...
2 3 Government Primary School No.1 Sarbaland Pura ... NA 1 KPK NA-1 peshawar-i government primary school no1 sarbaland pura m...
3 4 Government Girls Primary School Sarbaland Pura... NA 1 KPK NA-1 peshawar-i government girls primary school sarbaland pura...
4 5 Government Primary School Pahari Pura (Male) PS-1 NA 1 KPK NA-1 peshawar-i government primary school pahari pura male ps 1
... ... ... ... ... ... ... ... ...
68346 304 Govt Girls Meddal School Musi Wali (Female) (P) NA 72 PUNJAB NA-72 mianwali-ii govt girls meddal school musi wali female p
68347 305 Govt Boys High School Musi Wali Part-II (P) NA 72 PUNJAB NA-72 mianwali-ii govt boys high school musi wali part ii p
68348 306 Govt GirlsMeddal School Tibba Mehr Ban Shah (P) NA 72 PUNJAB NA-72 mianwali-ii govt girlsmeddal school tibba mehr ban shah p
68349 307 Govt Boys Pramiery School Murid Abbas Shah (P) NA 72 PUNJAB NA-72 mianwali-ii govt boys pramiery school murid abbas shah p
68350 308 Total Votes recorded on postal ballot for the ... NA 72 PUNJAB NA-72 mianwali-ii total Votes recorded on postal ballot for the ...
df2
ps_name ps_number province cons_number type lat lng description district ps_name_stripped ps_name_clean
0 Govt; High School (GHS) Arrandu 1 KPK 1 Combined 35.31008 71.55092 Polling Station: 1 - Govt; High School (GHS) A... chitral Govt; High School (GHS) Arrandu govt high school ghs arrandu
1 Govt Girls Primary School (GGPS) Arrandu 2 KPK 1 Combined 35.31202 71.55158 Polling Station: 2 - Govt Girls Primary School... chitral Govt Girls Primary School (GGPS) Arrandu govt girls primary school ggps arrandu
2 Govt: Primary School Marakabat (PS No.1) 3 KPK 1 Combined 35.29211 71.56821 Polling Station: 3 - Govt: Primary School Mara... chitral Govt: Primary School Marakabat (PS No.1) govt primary school marakabat ps no1
3 Govt; Primary School Akroy 4 KPK 1 Combined 35.34268 71.61663 Polling Station: 4 - Govt; Primary School Akro... chitral Govt; Primary School Akroy govt primary school akroy
4 Govt Primary School Langoorbat 5 KPK 1 Combined 35.32656 71.59922 Polling Station: 5 - Govt Primary School Lango... chitral Govt Primary School Langoorbat govt primary school langoorbat
... ... ... ... ... ... ... ... ... ... ... ...
46814 Govt: Primary School Basool Ormara. (Comb) 358 BALOCHISTAN 272 Combined 25.43741 64.38839 Polling Station: 358 - Govt: Primary School Ba... lasbela-cum-gwadar Govt: Primary School Basool Ormara. (Comb) govt primary school basool ormara comb
46815 Govt: Primary School Kahordan (Comb) 359 BALOCHISTAN 272 Combined 25.42091 64.41979 Polling Station: 359 - Govt: Primary School Ka... lasbela-cum-gwadar Govt: Primary School Kahordan (Comb) govt primary school kahordan comb
46816 Govt: Primary School Thussak (Comb) 360 BALOCHISTAN 272 Combined 25.45371 64.48357 Polling Station: 360 - Govt: Primary School Th... lasbela-cum-gwadar Govt: Primary School Thussak (Comb) govt primary school thussak comb
46817 Govt:Boys Meddle School Chill Hud Ormara. (Comb) 364 BALOCHISTAN 272 Combined 25.24679 64.60839 Polling Station: 364 - Govt:Boys Meddle School... lasbela-cum-gwadar Govt:Boys Meddle School Chill Hud Ormara. (Comb) govtboys meddle school chill hud ormara comb
46818 Govt: Boys Primary School Hud (Comb) 365 BALOCHISTAN 272 Combined 25.28109 64.64944 Polling Station: 365 - Govt: Boys Primary Sch... lasbela-cum-gwadar Govt: Boys Primary School Hud (Comb) govt boys primary school hud comb
我试图对dataframes
上的两个ps_name_clean
进行字符串匹配,而对district
进行屏蔽。
问题是,两个dataframes
并不完全共享相同的区。即df1具有df2没有的区域,反之亦然。在这种情况下,我的问题是,在字符串匹配期间df1中那些具有df2没有的区域的行会怎样?它们是否仍会与其他地区的所有行进行比较?
在下面运行后,我手动进行了检查,以查看这些行发生了什么,并且它们似乎确实与与它们不在同一地区的其他行以某种方式进行了匹配(但是不准确的匹配)。但是我不明白为什么,因为我封锁了地区,据我所知,这意味着只有相同地区的人才能相互比较。
indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1,df2)
c = recordlinkage.Compare()
c.string('ps_name_clean','ps_name_clean',method='damerau_levenshtein',threshold=0.75)
c_vectors = c.compute(candidate_links,df1,df2)
matches = c_vectors[c_vectors[0]==1.0]
matches.reset_index(inplace=True)
matches.drop_duplicates(subset = ['level_0'],keep = 'last')
有人可以解释这里发生了什么吗?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)