python-比较DF中两列的(Sub)字符串

我有一个DF,如下所示:

DF =
id  token      argument1             argument2 
1   Tza        Tuvia Tza             Moscow  
2   perugia    umbria                perugia    
3   associated the associated press  Nelson

我现在想比较argumentX和token列的值,并相应地为新列ARG选择值.

DF =
id  token      argument1             argument2    ARG
1   Tza        Tuvia Tza             Moscow       ARG1
2   perugia    umbria                perugia      ARG2
3   associated the associated press  Nelson       ARG1

这是我尝试过的:

conditions = [
(DF["token"] == (DF["Argument1"])),
 DF["token"] == (DF["Argument2"])]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=nan)

这只会比较整个String,如果匹配则匹配. .isin,.contains等结构或使用诸如DF [“ ARG_cat”] = DF.apply(lambda row:row [‘token’] in row [‘argument2’],axis = 1)之类的辅助列无效.有任何想法吗?

解决方法:

str.contains与正则表达式一起使用-通过|连接令牌中的所有值用于正则表达式或具有单词边界的检查子字符串:

pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in DF["token"])
conditions = [ DF["argument1"].str.contains(pat), DF["argument2"].str.contains(pat)]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
   id       token            argument1 argument2   ARG
0   1         Tza            Tuvia Tza    Moscow  ARG1
1   2     perugia               umbria   perugia  ARG2
2   3  associated  the associated ress    Nelson  ARG1

编辑:

如果要比较每一行:

d = {'id': [1, 2, 3], 
     'token': ["Tza","perugia","israel"], 
     "argument1": ["Tuvia Tza","umbria","Tuvia Tza"], 
     "argument2": ["israel","perugia","israel"]} 
DF = pd.DataFrame(data=d) 
print (DF)
   id    token  argument1 argument2
0   1      Tza  Tuvia Tza    israel
1   2  perugia     umbria   perugia
2   3   israel  Tuvia Tza    israel

conditions = [[x[0] in x[1] for x in zip(DF['token'], DF['argument1'])], 
              [x[0] in x[1] for x in zip(DF['token'], DF['argument2'])]]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
   id    token  argument1 argument2   ARG
0   1      Tza  Tuvia Tza    israel  ARG1
1   2  perugia     umbria   perugia  ARG2
2   3   israel  Tuvia Tza    israel  ARG2

相关文章

转载:一文讲述Pandas库的数据读取、数据获取、数据拼接、数...
Pandas是一个开源的第三方Python库,从Numpy和Matplotlib的基...
整体流程登录天池在线编程环境导入pandas和xrld操作EXCEL文件...
 一、numpy小结             二、pandas2.1为...
1、时间偏移DateOffset对象DateOffset类似于时间差Timedelta...
1、pandas内置样式空值高亮highlight_null最大最小值高亮背景...