BeautifulSoup中的find_all返回空的ResultSet

问题描述

我正在尝试从某个网站上抓取数据以进行网络抓取。但是findall()返回的是空集。我该如何解决这个问题?

#importing required modules

import requests,bs4

#sending request to the server

req = requests.get("https://www.udemy.com/courses/search/?q=python")

# checking the status on the request

print(req.status_code)
req.raise_for_status()

#converting using BeautifulSoup

soup = bs4.BeautifulSoup(req.text,'html.parser')

#Trying to scrape the particular div with the class but returning 0

container = soup.find_all('div',class_='popover--popover--t3rNO popover--popover-hover--14ngr')

#trying to print the number of container returned.
print(len(container))

输出:

200
0

解决方法

请参阅我的评论,它完全是由javascript驱动的内容。现代网站经常会使用JavaScript调用服务器的HTTP请求,以在需要时按需获取数据。在这里,如果您禁用了javascript,可以在浏览页面时转到更多设置来轻松地在chrome中执行此操作。您将看到此网站上没有可用的文本。正如您所指出的,这可能与imdb有很大不同。如果您检查了beautifulsoup解析的html,您会发现您没有任何使用javascript派生的实际页面源。

有两种方法可以从javascript呈现的网站获取数据

  1. 模仿对服务器的HTTP请求
  2. 浏览器自动化程序包,例如硒

第一种方法更好,更有效,因为第二种方法更易碎,不适用于较大的数据集。

幸运的是,udemy可以从API端点获取所需的数据,该端点使用JavaScript发出HTTP请求并将响应返回给浏览器。

代码示例

import requests

cookies = {
    '__udmy_2_v57r': '4f711b308da548b49394854a189d3179','ud_firstvisit': '2020-05-29T13:48:56.584511+00:00:1jefNY:9F1BJVEUJpv7gmNPgYNini76UaE','existing_user': 'true','optimizelyEndUserId': 'oeu1590760136407r0.2130390415126655','EUCookieMessageShown': 'true','_ga': 'GA1.2.1359933509.1590760142','_pxvid': '26d89ed1-a1b3-11ea-9179-cb750fa4136b','_ym_uid': '1585144165890161851','_ym_d': '1590760145','__ssid': 'd191bc02a1063fd2c75fbab525ededc','stc111655': 'env:1592304425%7C20200717104705%7C20200616111705%7C1%7C1014616:20210616104705|uid:1590760145861.374775813.04725504.111655.1839745362:20210616104705|srchist:1069270%3A1%3A20200629134905%7C1014624%3A1592252104%3A20200716201504%7C1014616%3A1592304425%3A20200717104705:20210616104705|tsa:0:20200616111705','ki_t': '1590760146239%3B1592304425954%3B1592304425954%3B3%3B5','ki_r': 'aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8%3D','IR_PI': '00aea1e6-9da9-11ea-af3a-42010a24660a%7C1592390825988','_gac_UA-12366301-1': '1.1592304441.CjwKCAjw26H3BRB2EiwAy32zhfcltNEr_HHFK5JRaJar5qxUn4ifG9FVFctWyTUXigNZvKeOCz7PgxoCAfAQAvD_BwE','csrftoken': 'pPOdtdbH0HPaHvDfAZMzEOdvWqKZuQWufu8dUrEeXuy5mOOrnFRbWZ9vq8Dfd2ts','__cfruid': 'f1963d736e3891a2e307ebc9f918c89065ffe40f-1596962093','__cfduid': 'df4d951c87bc195c73b2f12b5e29568381597085850','ud_cache_price_country': 'GB','ud_cache_device': 'desktop','ud_cache_language': 'en','ud_cache_logged_in': '0','ud_cache_release': '0804b40d37e001f97dfa','ud_cache_modern_browser': '1','ud_cache_marketplace_country': 'GB','ud_cache_brand': 'GBen_US','ud_cache_version': '1','ud_cache_user': '','seen': '1','eventing_session_id': '66otW5O9TQWd5BYq1_etrA-1597087737933','ud_cache_campaign_code': '','exaff': '%7B%22start_date%22%3A%222020-08-09T08%3A52%3A04.083577Z%22%2C%22code%22%3A%22_7fFXpljNdk-m3_OJPaWBwAQc5gVKutaSg%22%2C%22merchant_id%22%3A39197%2C%22aff_type%22%3A%22LS%22%2C%22aff_id%22%3A60680%7D:1k5D3W:2PemPLTm4xaHixBYRvRyBaAukL4','evi': 'SlFfLh4RBzwTSVBjXFdHehNJUGMYQE99HVFdIExYQ3gARVY8QkAWIEEDCXsVQEd0BEsJexVAA24LQgdjGANXdgZBG3ETH1luRBdHKBoHV3ZKURl5XVBXdkpRXWNUU1luRxIJe1lTQXhMDgdjHRAFbgsICXNWVk1uCwgJN0xYRGATBUpjVFVEdAEOB2NcWkR+E0lQYxhAT30dUV0gTFhCfAhDVm1MUEJ0B1EROkwUV3YAXwk3D0BPewFAHzxCQEd0BUcJexVAA24LQgdjGANXdgZCHHETTld+BkUdY1QZVzoTSRptTBQUbgtFEnleHwhgEwBcY1QZV34HShtjVBlXOhNJE21MFBRuC0UceV4fWW4DSxh3TFgObkdREXBCQAMtE0kccFtUCGATQR54VkBPNxMFCXtfTlc6UFERd1tUTTEdURlzX1JXdkpRXWNUU1luRxIJe1tXQnpMXwlzVldDbgsICTdMWEdgEwVKY1RVRHUJDgdjXFdCdBNJUGMYQE99HVFdIExYQ3kCQ1Y8Ew==','ud_rule_vars': 'eJyFjkuOwyAQBa9isZ04agyYz1ksIYxxjOIRGmhPFlHuHvKVRrPItvWqus4EXT4EDJP9jSViyobPktKRgZqc4GrkmmmuBHdU6YlRqY1P6RgDMQ05D2SOueCDtZPDMNT7QDrooAXRdrqhzHBlRL8XUjPgXwAGYCC7ulpdRX3acglPA8bvPwbVgm6g4p0Bvqeyhsh_BkybXyxmN8_R21J9vvpcjm5cn7ZDTidc7G2xxnvlm87hZwvlU7wE2VP1en0hlyuoG10j:1k5D3W:nxRv-tyLU7lxhsF2jRYvkJA53uM',}

headers = {
    'authority': 'www.udemy.com','x-udemy-cache-release': '0804b40d37e001f97dfa','x-udemy-cache-language': 'en','x-udemy-cache-user': '','x-udemy-cache-modern-browser': '1','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/84.0.4147.105 Safari/537.36','accept': 'application/json,text/plain,*/*','x-udemy-cache-brand': 'GBen_US','x-udemy-cache-version': '1','x-requested-with': 'XMLHttpRequest','x-udemy-cache-logged-in': '0','x-udemy-cache-price-country': 'GB','x-udemy-cache-device': 'desktop','x-udemy-cache-marketplace-country': 'GB','x-udemy-cache-campaign-code': '','sec-fetch-site': 'same-origin','sec-fetch-mode': 'cors','sec-fetch-dest': 'empty','referer': 'https://www.udemy.com/courses/search/?q=python','accept-language': 'en-US,en;q=0.9',}

params = (
    ('q','python'),('skip_price','false'),)

response = requests.get('https://www.udemy.com/api-2.0/search-courses/',headers=headers,params=params,cookies=cookies)

ids = []
titles = []
durations = []
ratings = []
for a in response.json()['courses']:
    title = a['title']
    duration =int(a['estimated_content_length']) / 60
    rating = a['rating']
    id = str(a['id'])
    titles.append(title)
    ids.append(id)
    durations.append(duration)
    ratings.append(rating)


clean_ids = ','.join(ids)
params2 = (
    ('course_ids',clean_ids),('fields/[pricing_result/]','price,discount_price,list_price,price_detail,price_serve_tracking_id'),)

response = requests.get('https://www.udemy.com/api-2.0/pricing/',params=params2)
data = response.json()['courses']
prices = []
for a in ids: 
    price = response.json()['courses'][a]['price']['amount']
    prices.append(price)

data = zip(titles,durations,ratings,prices)
for a in data:
    print(a)

输出

('Learn Python Programming Masterclass',56.53333333333333,4.54487,14.99)
('The Python Mega Course: Build 10 Real World Applications',25.3,4.51476,16.99)
('Python for Beginners: Learn Python Programming (Python 3)',2.8833333333333333,4.4391,17.99)
('The Python Bible™ | Everything You Need to Program in Python',9.15,4.64238,17.99)
('Python for Absolute Beginners',3.066666666666667,4.42209,14.99)
('The Modern Python 3 Bootcamp',30.3,4.64714,16.99)
('Python for Finance: Investment Fundamentals & Data Analytics',8.25,4.52908,12.99)
('The Complete Python Course | Learn Python by Doing',35.31666666666667,4.58885,17.99)
('REST APIs with Flask and Python',17.033333333333335,4.61233,12.99)
('Python for Financial Analysis and Algorithmic Trading',16.916666666666668,4.53173,12.99)
('Python for Beginners with Examples',4.25,4.27316,12.99)
('Python OOP : Four Pillars of OOP in Python 3 for Beginners',2.6166666666666667,4.46451,12.99)
('Python Bootcamp 2020 Build 15 working Applications and Games',32.13333333333333,4.2519,14.99)
('The Complete Python Masterclass: Learn Python From Scratch',32.36666666666667,4.39151,16.99)
('Learn Python MADE EASY : A Concise Python Course in Python 3',2.1166666666666667,4.76601,12.99)
('Complete Python Web Course: Build 8 Python Web Apps',15.65,4.37577,13.99)
('Python for Excel: Use xlwings for Data Science and Finance',16.116666666666667,4.92293,12.99)
('Python 3 Network Programming - Build 5 Network Applications',12.216666666666667,4.66143,12.99)
('The Complete Python & PostgreSQL Developer Course',21.833333333333332,4.5664,12.99)
('The Complete Python Programmer Bootcamp 2020',13.233333333333333,4.63859,12.99)
    

解释

有两种方法可以做到这一点,这里是重新设计请求,这是更有效的解决方案。为了获得必要的信息,您需要检查页面并查看哪些HTTP请求提供了哪些信息。您可以在检查页面时通过网络工具-> XHR进行此操作。您可以看到有两个请求可为您提供信息。我的建议是选择请求时,在右侧预览响应。第一个要求您提供标题,持续时间,价格,评分,第二个要求您提供课程编号以获取课程价格。

我通常将JavaScript调用的HTTP请求的CURL复制到curl.trillworks.com中,然后将必要的标头,参数和cookie转换为python格式。

在第一个请求中,标题,cookie和参数是必需的。第二个请求,只需要参数。

您得到的响应是一个json对象。 response.json()将此转换为python字典。您必须在这本词典中做一些挖掘才能得到想要的东西。但是,对于response.json()['courses']中的每个项目,网站上每个“卡”的所有必需数据都在那里。因此,我们围绕数据在我们创建的字典中的位置进行了for循环。我会和response.json()一起玩,直到您对对象为您提供的理解代码的感觉。

持续时间以分钟为单位,因此我在此处已快速转换为小时。 id也是一个字符串,因为在第二个请求中,我们将它们用作参数来获取课程的必要价格。我们将id转换为字符串并将其作为参数输入。

第二个请求然后给了我们必要的价格,同样,您必须去字典对象,我建议您自己确认嵌套的价格。

我们压缩后的数据合并了所有数据列表,然后我做了一个for循环来打印所有数据。如果需要,可以将其喂入大熊猫...

,

要获取必需的数据,您需要将请求发送到适当的API。为此,您需要创建会话:

import requests

s = requests.Session()
cookies = s.get('https://www.udemy.com').cookies
headers={"Referer": "https://www.udemy.com/courses/search/?q=python&skip_price=false"}

for page_counter in range(1,500):
    data = s.get('https://www.udemy.com/api-2.0/search-courses/?p={}&q=python&skip_price=false'.format(page_counter),cookies=cookies,headers=headers).json()
    for course in data['courses']:
        params = {'course_ids': [str(course['id']),],'fields/[pricing_result/]': ['price',]}
        title = course['title']
        price = s.get('https://www.udemy.com/api-2.0/pricing/',cookies=cookies).json()['courses'][str(course['id'])]['price']['amount']
        print({'title': title,'price': price})

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...