从HTML使用报纸提取图像

问题描述

我无法像通常那样实例化Article对象的方式下载文章,如下所示:

from newspaper import Article
url = 'http://fox13Now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image

但是,我可以从请求中获取HTML。我可以使用此原始HTML并将其以某种方式传递给Newspaper以从中提取图像吗? (以下是尝试,但无效)。谢谢

from newspaper import Article
import requests
url = 'http://fox13Now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url,verify=False,proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image

解决方法

Python模块 Newspaper 允许使用代理,但是此功能未在模块的文档中列出。


代理报纸

from newspaper import Article
from newspaper.configuration import Configuration

# add your corporate proxy information and test the connection
PROXIES = {
           'http': "http://ip_address:port_number",'https': "https://ip_address:port_number"
          }

config = Configuration()
config.proxies = PROXIES

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url,config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

带有代理和报纸的请求

import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url,verify=False,proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
,

首先请确保您正在使用python3,并且之前已经运行过pip3 install newspaper3k

然后,如果您遇到第一个版本的SSL错误(如下所示)

/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981:InsecureRequest警告:正在向主机“ fox13now.com”发出未经验证的HTTPS请求。强烈建议添加证书验证。参见:https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings warnings.warn(

您可以通过添加来禁用它们

import urllib3
urllib3.disable_warnings()

这应该有效:

from newspaper import Article
import urllib3
urllib3.disable_warnings()


url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)

运行python3 <yourfile>.py


您自己在Article中设置html不会带来多大好处,因为那样您将无法在其他字段中获得任何帮助。让我知道这是否可以解决问题,或者是否弹出其他任何错误!

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...