问题描述
我正在尝试编写一个程序来获取关键字列表并从用户输入的网站中打印出包含此类单词的句子。现在我的输出正在打印大量额外的东西,例如符号,我希望它只为每次出现打印句子。我该怎么做? 到目前为止的代码:
#Import Packages
import requests
from bs4 import BeautifulSoup
import urllib.request as ul
url = input('Enter URL:')
reg= requests.get(url,allow_redirects=False)
soup = BeautifulSoup(req.content,"lxml")
words = ["technology","wireless"]
for word in words:
print(word,soup.find(text=lambda text: text and word in text))
解决方法
您可以使用 NLTK(自然语言工具包)http://www.nltk.org/ 将文本转换为句子。安装:
pip install nltk
然后在 Python 中运行以下几行:
import nltk.data
nltk.download('punkt')
然后代码是这样的:
import requests
from bs4 import BeautifulSoup
import nltk.data
words = ["technology","wireless","people"]
url = 'https://marketbusinessnews.com/financial-glossary/wireless-technology/'
reg = requests.get(url,allow_redirects=False)
soup = BeautifulSoup(reg.content,"lxml")
# get rid of unwanted tags
for unwanted_tag in soup(['script','style','head','title','meta']):
unwanted_tag.decompose()
# get the text from the soup
texts = " ".join(soup.stripped_strings)
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for word in words:
print("######",word,"######")
for text in tokenizer.tokenize(texts):
if word in text.lower(): # cast to lower to make search case insensitive
print(text)
此解决方案并不完美,因为您最终可能会在您不期望它们的地方出现空格,但替代方案是在您期望它们的地方没有空格。
,最好选择 <p>
标签,它定义了一个段落。
for word in words:
for p in soup.select('p'):
if word in p.text:
print(word,p.text)