Python匹配来自字典问题的各种关键字

问题描述

我有一个复杂的文本，我正在对字典中存储的不同关键字进行分类：

    text = 'data-ls-static="1">Making Bio Implants,Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'

    sector = {"med tech": ['Drug Delivery' '3D printing','medicine','medical technology','bio cell']}

这可以成功找到我的关键字并对它们进行分类，但有一些限制：

    pattern = r'[a-zA-Z0-9]+'

    [cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]

我无法解决的限制是：

例如，以空格分隔的诸如“药物递送”之类的关键字无法识别，因此无法进行分类。
我无法使模式不区分大小写，因为无法识别诸如 MEDICINE 之类的词。我试图将 (?i) 添加到模式中，但它不起作用。
分类的关键字进入pandas df，但它们被打印到[]中。我试图再次循环脚本以将它们取出，但它们仍然存在。

数据到pandas df：

    ind_list = []
    for site in url_list:
        ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
        ind_list.append(ind)

    websites['Indication'] = ind_list

当前输出：

Website                                  Sector                              Sub-sector                                 Therapeutical Area Focus URL status
0     url3.com                              [med tech]                                      []                                                 []          []         []
1     www.url1.com                    [med tech,services]                                      []                       [oncology,gastroenterology]          []         []
2     www.url2.com                    [med tech,services]                                      []                                        [orthopedy]          []         []

在输出中我得到 [] 我想避免。

你能帮我解决这些问题吗？

谢谢！

解决方法

findall 在这里非常浪费，因为您要为每个关键字反复分解字符串。

如果要测试关键字是否在字符串中：

[cat for cat in sector if any(re.search(word,text,re.I) for word in sector[cat])]
# Output: med tech

在这里给你一些很容易被发现的问题的提示：

为什么不能匹配以空格分隔的关键字，如“Drug Delivery”？这是因为正则表达式模式 r'[a-zA-Z0-9]+' 不匹配空格。如果您还想匹配一个空格，您可以将其更改为 r'[a-zA-Z0-9 ]+'（在 9 后添加一个空格）。但是，如果您想支持其他类型的空格（例如\t、\n），则需要进一步更改此正则表达式模式。
为什么不支持不区分大小写的匹配？您的代码片段 any(x in re.findall(pattern,text) for x in sector[cat]) 要求 x 具有相同的大写/小写，因为两者都是由re.findall 并在 sector[cat]。甚至无法通过在 flags=re.I 调用中设置 re.findall() 来绕过此约束。建议您在检查之前将它们全部转换为相同的大小写。也就是说，例如，在匹配之前将它们全部更改为小写：any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) 这里我们将 .lower() 添加到 text 和 x.lower()。

通过上述 2 项更改，您应该可以捕获一些分类关键字。

实际上，对于这种特殊情况，您可能根本不需要使用正则表达式和 re.findall。您可以检查例如sector[cat][i].lower()) in text.lower()。即，将列表推导式更改如下：

[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]

编辑

使用 2 个词的短语进行测试运行：

text = 'drug delivery'
sector = {"med tech": ['Drug Delivery','3D printing','medicine','medical technology','bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]

Output:       # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']

text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]

Ouptput:    # Correctly doesn't match with extra words in between 

[]

您可以尝试使用正则表达式以外的其他方法吗，
当您有两个相似的匹配词时，我会建议 difflib。

keyword keyword-search pandas pandas python regex regex regex