newsletter3k，在第一个“by”字之后在可见文本中找到作者姓名

问题描述

Newsletter3K 是一个很好的用于新闻内容提取的 Python 库。它大部分效果很好 .我想在可见文本中的第一个“by”字之后提取名称。这是我的代码，它不能正常工作，请有人帮忙：

import re
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
USER_AGENT = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10.15; rv:78.0)Gecko/20100101   Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10 
html1='https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorIoUs-food-part-2/'
article = Article(html1.strip(),config=config)
article.download()
article.parse()
soup = BeautifulSoup(article)
## I want to take only visible text
[s.extract() for s in soup(['style','script','[document]','head','title'])]
visible_text = soup.getText()
for line in visible_text:
    # Capture one-or-more words after first (By or by) the initial match
    match = re.search(r'By (\S+)',line)

    # Did we find a match?
    if match:
        # Yes,process it to print 
        By = match.group(1)
        print('By {}'.format(By))`

解决方法

这不是一个全面的答案，但它是您可以构建的答案。添加其他源时，您需要扩展此代码。就像我之前所说的，我的 Newspaper3k overview document 有很多提取示例，所以请仔细阅读。

在使用 newspaper3k 尝试这些提取方法后，正则表达式应该是最后的努力：

文章作者
元标记
json
汤

from newspaper import Config
from newspaper import Article
from newspaper.utils import BeautifulSoup

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

urls = ['https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2','https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid','https://www.cnn.com/2021/02/12/asia/india-glacier-raini-village-chipko-intl-hnk/index.html','https://www.latimes.com/california/story/2021-02-13/wildfire-santa-cruz-boulder-creek-residents-fear-water'
        '-quality','https://foxbaltimore.com/news/local/maryland-lawmakers-move-ahead-with-first-tax-on-internet-ads-02-13-2021']

for url in urls:
    try:
        article = Article(url,config=config)
        article.download()
        article.parse()
        author = article.authors
        if author:
            print(author)
        elif not author:
            soup = BeautifulSoup(article.html,'html.parser')
            author_tag = soup.find(True,{'class': ['td-post-author-name','byline']}).find(['a','span'])
            if author_tag:
                print(author_tag.get_text().replace('By','').strip())
            else:
                print('no author found')
    except AttributeError as e:
        pass

beautifulsoup beautifulsoup extract extract newspaper3k visible word word

newsletter3k，在第一个“by”字之后在可见文本中找到作者姓名

问题描述

解决方法

相关问答