BeautifulSoup返回“无”

问题描述

我正在尝试使用BeautifulSoup从G2网站获得评论列表。但是,由于某种原因,当我运行下面的代码时,它说'reviews'是'NoneType'。我无法弄清楚,因为它清楚地在网站的HTML中显示了类名(请参见下图)。我已经使用这种确切的语法从其他站点进行webscrape,并且有效,所以我不知道为什么它返回NoneType。我尝试使用'find_all'并返回列表的长度(评论数),但这也显示了nonetype。我很困惑。请帮忙!

response = requests.get('https://www.g2.com/products/mailchimp/reviews?filters%5Bcomment_answer_values%5D=&order=most_recent&page=1')
text = BeautifulSoup(response.text,'html.parser')


num_reviews = 500

reviews = text.find('div',attrs={'class': 'paper paper--white paper--box mb-2 position-relative border-bottom '})
print(reviews)

Attached div screenshot

解决方法

您需要将标头传递给HTTP请求。它会检测到您不是浏览器,如果您打印出可变文本,则会看到它。

您解析的HTML

...
<h1>Pardon Our Interruption...</h1>
<p>
                        As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
                    </p>
<ul>
<li>You're a power user moving through this website with super-human speed.</li>
<li>You've disabled JavaScript and/or cookies in your web browser.</li>
<li>A third-party browser plugin,such as Ghostery or NoScript,is preventing JavaScript from running. Additional information is available in this <a href="http://ds.tl/help-third-party-plugins" target="_blank" title="Third party browser plugins that block javascript">support article</a>.</li>
...

因此,传递标题就足以模仿浏览器的活动。 抢头

代码示例

import requests

headers = {
    'authority': 'www.g2.com','cache-control': 'max-age=0','upgrade-insecure-requests': '1','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/84.0.4147.105 Safari/537.36','accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','sec-fetch-site': 'none','sec-fetch-mode': 'navigate','sec-fetch-user': '?1','sec-fetch-dest': 'document','accept-language': 'en-US,en;q=0.9','cookie': '__cfduid=df6514ad701b86146978bf17180a5e6f01597144255; events_distinct_id=822bbff7-912d-4a5e-bd80-4364690b2e06; amplitude_session=1597144258387; _g2_session_id=424bfbe09b254b1a9484f50b70c3381c; reese84=3:BJ8QXTaIa+brQNrbReKzww==:n5v0tg/Q590u2q44+xAi7rnSO1i2Kn7Lp1Ar+2SCMJF5HiBJNqLVR3IPzPF0qIqgxpWjZ9veyhywY4JNSbBOtz5sJOwEecGJE9tT+NInof+vlP3hKTb6bqA3cvAf6cfDIrtEmhI0Dsjoe3ct3NtwvvcA9p8FXHPR7PAFP42nWqAAfDH88vj0hQwWlIjio/fT4g5iDsT1qZH3alC8ZbUhOURKNk9JUz2sBz+RjgkRyctO0VTGzjxmHCd2r40WJqWjVDwRmBl+/msW+/V0PW93vjFs45bMD63D5Q4JeRreBxkAN9ufIajaV0MmkYbxlFnwIZ3cEBHi/X76n+PvAobd5/UgCwgUIvt/P4pl7NEcDWR/ORaZ8gLPl4HbuQaRhEVd23Ez5OBnYFP1wjqLT/ECDkRzQq0Nn8U6qVbMO25Hp6U=:/JrPeXs0AKDQw5FlG3vKQX1dPIsF/TEXTLgQ+mktyAo=; ue-event-segment-983a43a0-1c10-4dfb-96d7-60049c0dcd62=W1siL3VzZXJzL2NvbnNlbnQvc2VsZWN0ZWQiLHsiY29uc2VudF90eXBlIjoi%0AY29va2llcyIsImdyYW50ZWQiOiJ0cnVlIn0sIjk4M2E0M2EwLTFjMTAtNGRm%0AYi05NmQ3LTYwMDQ5YzBkY2Q2MiIsIlVzZXIgQ29uc2VudCBTZWxlY3RlZCIs%0AWyJhbXBsaXR1ZGUiXV1d%0A','if-none-match': 'W/"3658e5098c91c183288fd70e6cfd9028"',}

response = requests.get('https://www.g2.com/products/mailchimp/reviews',headers=headers)

text = BeautifulSoup(response.text,'html.parser')


num_reviews = 500

reviews = text.select('div[class*="paper paper--white paper--box"]')

print(len(reviews))

输出

25 

解释

有时为了发出HTTP请求,必须传递标头,用户代理,cookie,参数。您可以尝试一下,我必须承认我很懒,只是发送了整个标题。本质上,您正在尝试通过使用请求包来模仿浏览器请求。有时,它在检测漫游器方面会更加细微。

在这里,我检查了页面并转到了网络工具。有一个名为doc的标签。然后,通过右键单击请求并单击“ COPY curl(bash)”,我复制了该请求。正如我说的那样,我很懒,所以我将其粘贴到curl.trillworks.com中,它将转换为漂亮的python格式以及请求的样板。

我已经稍微修改了您的脚本,因为这是一个很长的属性

CSS选择器div[class*=""]捕获您指定的类""的任何元素。

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...