Python - 检查列表中的关键字是否在字符串中作为一个整体并返回找到的关键字

问题描述

还没有找到专门针对这个想法的解决方案,所以这是我的问题。

我有一个关键字列表,我想将这些关键字与从网站上抓取的字符串进行匹配。此列表存储在自己的 Python 文件“Keywords”中,内容如下:

keywords = [
    "FDA","Contract","Vaccine","Efficacy","SARS","COVID-19","Cancer","Exclusive","Explosive","Hydrogen","Positive","Phase"
]

文件已导入,要访问此列表,我可以使用 Keywords.keywords

#1 用字符串匹配关键字:

我想检查抓取的字符串 article_title = item.select_one('h3 small').find_next_sibling(text=True).strip() 是否包含这些关键字之一。如果是这样,我想搜索更多内容(已获取代码)。否则,我将返回到 for 循环的开头并搜索一个标题

以下是字符串 article_title输出示例:

Global Water and Sewage Market Report (2021 to 2030) - COVID-19 Impact and Recovery
Blackbaud CEO Mike Gianoni Named One of 50 Most Influential by Charleston Business Magazine
Statement from Judy R. McReynolds on Signing of HR1319,the American Rescue Plan Act of 2021

通过仅搜索整个单词来匹配关键字列表与字符串的最佳方法是什么?我在 SO 上找到了多种方法,但它们似乎都有缺陷,人们指出这让我感到困惑。

#2 将找到的关键字存储在变量中:

当与关键字匹配时,我将找到的 article_title 变量和其他变量存储在数据库中,以防找到关键字。但是,我还想在我的数据库中存储导致条目的关键字。这让我知道每个关键字被找到的次数。我存储找到的关键字的变量应称为 article_keyword。有没有办法不仅将关键字与字符串匹配,还可以存储找到的关键字?如果是,我将很高兴获得有关如何执行此操作的帮助。

如果提供的信息还不够,请通过评论告诉我,我会添加完整的代码。只是出于缩短问题的原因将其省略了。

解决方法

这是使用 regex 的一种方法:

import re

keywords = [
    "FDA","Contract","Vaccine","Efficacy","SARS","COVID-19","Cancer","Exclusive","Explosive","Hydrogen","Positive","Phase"
]

titles = [
    "Global Water and Sewage Market Report (2021 to 2030) - COVID-19 Impact and Recovery","Blackbaud CEO Mike Gianoni Named One of 50 Most Influential by Charleston Business Magazine","Statement from Judy R. McReynolds on Signing of HR1319,the American Rescue Plan Act of 2021",]

pattern = '|'.join(f"\\b{k}\\b" for k in keywords)  # Whole words only                                                      
matches = {k: 0 for k in keywords}
for title in titles:
    for match in re.findall(pattern,title):
        matches[match] += 1
,

您可以遍历列表并使用 'in' 运算符,我们可以检查它是否存在于字符串中:

strings = ["Global Water and Sewage Market Report (2021 to 2030) - COVID-19 Impact and Recovery",the American Rescue Plan Act of 2021"]

keywords = [
    "FDA","Phase"
]

article_keywords = {}

for string in strings:
    for word in keywords:
        if word in string:
            article_keywords[string] = word
            break

print(article_keywords)

在字典(article_keywords)中,键是字符串,值是找到的第一个关键字。