数据挖掘 IMDB 评论 - 仅提取前 25 条评论

问题描述

我目前正在尝试提取关于蜘蛛侠英雄归来电影的所有评论,但我只能获得前 25 条评论。我能够在 IMDB 中加载更多以获取所有评论,因为它最初只显示前 25 条评论,但由于某种原因,我无法在加载每个评论后挖掘所有评论。有谁知道我做错了什么?

下面是我正在运行的代码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import time
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


#Set the web browser
driver = webdriver.Chrome(executable_path=r"C:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")

#Go to Google
driver.get("https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")

#Loop load more button
wait = webdriverwait(driver,10)
while True:
    try:
        driver.find_element_by_css_selector("button#load-more-trigger").click()
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
        soup = BeautifulSoup(driver.page_source,'lxml')
    except Exception:break


#Scrape IMBD review
ans = driver.current_url
page = requests.get(ans)
soup = BeautifulSoup(page.content,"html.parser")
all = soup.find(id="main")

#Get the title of the movie
all = soup.find(id="main")
parent = all.find(class_ ="parent")
name = parent.find(itemprop = "name")
url = name.find(itemprop = 'url')
film_title = url.get_text()
print('Pass finding phase.....')

#Get the title of the review
title_rev = all.select(".title")
title = [t.get_text().replace("\n","") for t in title_rev]
print('getting title of reviews and saving into a list')

#Get the review
review_rev = all.select(".content .text")
review = [r.get_text() for r in review_rev]
print('getting content of reviews and saving into a list')

#Make it into dataframe
table_review = pd.DataFrame({
    "Title" : title,"Review" : review
})
table_review.to_csv('Spiderman_Reviews.csv')

print(title)
print(review)

解决方法

嗯,实际上,没有必要使用 Selenium。数据可通过以下格式向网站 API 发送 GET 请求获得:

https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey=MY-KEY

您必须为 URL (key) 中的 paginationKey 提供 ...&paginationKey=MY-KEY

key 位于类 load-more-data 中:

<div class="load-more-data" data-key="g4wp7crmqizdeyyf72ux5nrurdsmqhjjtzpwzouokkd2gbzgpnt6uc23o4zvtmzlb4d46f2swblzkwbgicjmquogo5tx2">
            </div>

因此,要将所有评论收集到 DataFrame 中,请尝试:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"title": [],"review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content,"html.parser")
    # Find the pagination key
    pagination_key = soup.find("div",class_="load-more-data")
    if not pagination_key:
        break

    # Update the `key` variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title,review in zip(
        soup.find_all(class_="title"),soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())

df = pd.DataFrame(data)
print(df)

输出(截断):

                                                title                                             review
0                              Terrific entertainment  Spiderman: Far from Home is not intended to be...
1         THe illusion of the identity of Spider man.  Great story in continuation of spider man home...
2                       What Happened to the Bad Guys  I believe that Quinten Beck/Mysterio got what ...
3                                         Spectacular  One of the best if not the best Spider-Man mov...

...
...

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...