如何从列表中的多个 URL 中抓取和提取相同的特定信息

问题描述

我想抓取电影的类型和长度(运行时间)以获得 250 部电影的列表。 名为“链接”的列表包含这 250 个电影页面的 URL。 我编写了一个代码,用于从包含 250 个 URL 的列表“链接”中的单个 URL 中提取流派和长度。

links=['https://www.imdb.com/title/tt0093603/','https://www.imdb.com/title/tt8176054/','https://www.imdb.com/title/tt0367495/','https://www.imdb.com/title/tt0048473/','https://www.imdb.com/title/tt0079221/','https://www.imdb.com/title/tt7391996/','https://www.imdb.com/title/tt0052572/','https://www.imdb.com/title/tt0237376/','https://www.imdb.com/title/tt0214915/','https://www.imdb.com/title/tt5311546/','https://www.imdb.com/title/tt7019842/','https://www.imdb.com/title/tt0105575/','https://www.imdb.com/title/tt0400234/','https://www.imdb.com/title/tt8413338/','https://www.imdb.com/title/tt12361178/','https://www.imdb.com/title/tt4991384/','https://www.imdb.com/title/tt1187043/','https://www.imdb.com/title/tt8948790/','https://www.imdb.com/title/tt0986264/','https://www.imdb.com/title/tt10189514/','https://www.imdb.com/title/tt0101649/','https://www.imdb.com/title/tt5074352/','https://www.imdb.com/title/tt9477520/','https://www.imdb.com/title/tt7060344/','https://www.imdb.com/title/tt9900782/','https://www.imdb.com/title/tt0291855/','https://www.imdb.com/title/tt0048956/','https://www.imdb.com/title/tt0085743/','https://www.imdb.com/title/tt0050870/','https://www.imdb.com/title/tt7738784/','https://www.imdb.com/title/tt5959980/','https://www.imdb.com/title/tt0059246/','https://www.imdb.com/title/tt4987556/','https://www.imdb.com/title/tt0312859/','https://www.imdb.com/title/tt0072783/','https://www.imdb.com/title/tt0119385/','https://www.imdb.com/title/tt0292246/','https://www.imdb.com/title/tt10214826/','https://www.imdb.com/title/tt7019942/','https://www.imdb.com/title/tt3417422/','https://www.imdb.com/title/tt7465992/','https://www.imdb.com/title/tt5867800/','https://www.imdb.com/title/tt6148156/','https://www.imdb.com/title/tt8239946/','https://www.imdb.com/title/tt0466460/','https://www.imdb.com/title/tt0459516/','https://www.imdb.com/title/tt4679210/','https://www.imdb.com/title/tt0376127/','https://www.imdb.com/title/tt0066763/','https://www.imdb.com/title/tt3973410/','https://www.imdb.com/title/tt3668162/','https://www.imdb.com/title/tt0220656/','https://www.imdb.com/title/tt6380520/','https://www.imdb.com/title/tt0195231/','https://www.imdb.com/title/tt8108198/','https://www.imdb.com/title/tt4429128/','https://www.imdb.com/title/tt2877108/','https://www.imdb.com/title/tt2181831/','https://www.imdb.com/title/tt3569782/','https://www.imdb.com/title/tt0376076/','https://www.imdb.com/title/tt1954470/','https://www.imdb.com/title/tt1620933/','https://www.imdb.com/title/tt5312232/','https://www.imdb.com/title/tt2356180/','https://www.imdb.com/title/tt0242519/','https://www.imdb.com/title/tt4934950/','https://www.imdb.com/title/tt0367110/','https://www.imdb.com/title/tt0073707/','https://www.imdb.com/title/tt2218988/','https://www.imdb.com/title/tt0871510/','https://www.imdb.com/title/tt0375611/','https://www.imdb.com/title/tt0104561/','https://www.imdb.com/title/tt0054098/','https://www.imdb.com/title/tt1562872/','https://www.imdb.com/title/tt4430212/','https://www.imdb.com/title/tt4851630/','https://www.imdb.com/title/tt5005684/','https://www.imdb.com/title/tt10324144/','https://www.imdb.com/title/tt1639426/','https://www.imdb.com/title/tt0057935/','https://www.imdb.com/title/tt7060460/','https://www.imdb.com/title/tt1280558/','https://www.imdb.com/title/tt3322420/','https://www.imdb.com/title/tt4635372/','https://www.imdb.com/title/tt0242256/','https://www.imdb.com/title/tt0200087/','https://www.imdb.com/title/tt0374887/','https://www.imdb.com/title/tt0139876/','https://www.imdb.com/title/tt0292490/','https://www.imdb.com/title/tt0105271/','https://www.imdb.com/title/tt9052870/','https://www.imdb.com/title/tt2283748/','https://www.imdb.com/title/tt0405508/','https://www.imdb.com/title/tt0364647/','https://www.imdb.com/title/tt0169102/','https://www.imdb.com/title/tt1821480/','https://www.imdb.com/title/tt0109117/','https://www.imdb.com/title/tt8291224/','https://www.imdb.com/title/tt2338151/','https://www.imdb.com/title/tt2358592/','https://www.imdb.com/title/tt0453729/','https://www.imdb.com/title/tt0319736/','https://www.imdb.com/title/tt0843326/','https://www.imdb.com/title/tt2082197/','https://www.imdb.com/title/tt5571734/','https://www.imdb.com/title/tt0112553/','https://www.imdb.com/title/tt0379370/','https://www.imdb.com/title/tt8144834/','https://www.imdb.com/title/tt0488414/','https://www.imdb.com/title/tt0116630/','https://www.imdb.com/title/tt13299890/','https://www.imdb.com/title/tt0456144/','https://www.imdb.com/title/tt7822438/','https://www.imdb.com/title/tt5824826/','https://www.imdb.com/title/tt4849438/','https://www.imdb.com/title/tt0072860/','https://www.imdb.com/title/tt1695800/','https://www.imdb.com/title/tt2564144/','https://www.imdb.com/title/tt1261047/','https://www.imdb.com/title/tt0063404/','https://www.imdb.com/title/tt0471571/','https://www.imdb.com/title/tt7392212/','https://www.imdb.com/title/tt3390572/','https://www.imdb.com/title/tt0112870/','https://www.imdb.com/title/tt6315524/','https://www.imdb.com/title/tt5906392/','https://www.imdb.com/title/tt0213969/','https://www.imdb.com/title/tt2882328/','https://www.imdb.com/title/tt0050188/','https://www.imdb.com/title/tt1821317/','https://www.imdb.com/title/tt2377938/','https://www.imdb.com/title/tt7838252/','https://www.imdb.com/title/tt10919240/','https://www.imdb.com/title/tt1180583/','https://www.imdb.com/title/tt1773764/','https://www.imdb.com/title/tt3394420/','https://www.imdb.com/title/tt7725596/','https://www.imdb.com/title/tt2395469/','https://www.imdb.com/title/tt1327035/','https://www.imdb.com/title/tt3863552/','https://www.imdb.com/title/tt1649431/','https://www.imdb.com/title/tt0051792/','https://www.imdb.com/title/tt0220832/','https://www.imdb.com/title/tt1857670/','https://www.imdb.com/title/tt3614516/','https://www.imdb.com/title/tt7180544/','https://www.imdb.com/title/tt0296574/','https://www.imdb.com/title/tt7294534/','https://www.imdb.com/title/tt3449292/','https://www.imdb.com/title/tt11581174/','https://www.imdb.com/title/tt2585562/','https://www.imdb.com/title/tt1188996/','https://www.imdb.com/title/tt5082014/','https://www.imdb.com/title/tt3124456/','https://www.imdb.com/title/tt8110330/','https://www.imdb.com/title/tt0347304/','https://www.imdb.com/title/tt1093370/','https://www.imdb.com/title/tt2924472/','https://www.imdb.com/title/tt1609168/','https://www.imdb.com/title/tt6167894/','https://www.imdb.com/title/tt0118751/','https://www.imdb.com/title/tt7485048/','https://www.imdb.com/title/tt2325915/','https://www.imdb.com/title/tt0375878/','https://www.imdb.com/title/tt1417299/','https://www.imdb.com/title/tt7218518/','https://www.imdb.com/title/tt0323013/','https://www.imdb.com/title/tt8108200/','https://www.imdb.com/title/tt2631186/','https://www.imdb.com/title/tt0455829/','https://www.imdb.com/title/tt0824316/','https://www.imdb.com/title/tt0222012/','https://www.imdb.com/title/tt11322920/','https://www.imdb.com/title/tt3848892/','https://www.imdb.com/title/tt10717738/','https://www.imdb.com/title/tt4387040/','https://www.imdb.com/title/tt5764096/','https://www.imdb.com/title/tt0366840/','https://www.imdb.com/title/tt2181931/','https://www.imdb.com/title/tt1517561/','https://www.imdb.com/title/tt0373856/','https://www.imdb.com/title/tt2926068/','https://www.imdb.com/title/tt2350496/','https://www.imdb.com/title/tt1077248/','https://www.imdb.com/title/tt0402014/','https://www.imdb.com/title/tt13206926/','https://www.imdb.com/title/tt8130968/','https://www.imdb.com/title/tt0816258/','https://www.imdb.com/title/tt6108090/','https://www.imdb.com/title/tt4169250/','https://www.imdb.com/title/tt0291376/','https://www.imdb.com/title/tt2317337/','https://www.imdb.com/title/tt0093578/','https://www.imdb.com/title/tt7098658/','https://www.imdb.com/title/tt4434004/','https://www.imdb.com/title/tt1907761/','https://www.imdb.com/title/tt7758160/','https://www.imdb.com/title/tt0077451/','https://www.imdb.com/title/tt4432480/','https://www.imdb.com/title/tt1230165/','https://www.imdb.com/title/tt0420332/','https://www.imdb.com/title/tt3822396/','https://www.imdb.com/title/tt1851988/','https://www.imdb.com/title/tt5121000/','https://www.imdb.com/title/tt1288638/','https://www.imdb.com/title/tt0499375/','https://www.imdb.com/title/tt0431619/','https://www.imdb.com/title/tt2187153/','https://www.imdb.com/title/tt0196069/','https://www.imdb.com/title/tt2213054/','https://www.imdb.com/title/tt3801314/','https://www.imdb.com/title/tt1292703/','https://www.imdb.com/title/tt4981966/','https://www.imdb.com/title/tt1266583/','https://www.imdb.com/title/tt1839596/','https://www.imdb.com/title/tt0422320/','https://www.imdb.com/title/tt7998242/','https://www.imdb.com/title/tt2258337/','https://www.imdb.com/title/tt0110222/','https://www.imdb.com/title/tt0109555/','https://www.imdb.com/title/tt6484982/','https://www.imdb.com/title/tt4900716/','https://www.imdb.com/title/tt3320542/','https://www.imdb.com/title/tt7142506/','https://www.imdb.com/title/tt1241195/','https://www.imdb.com/title/tt8108268/','https://www.imdb.com/title/tt0150433/','https://www.imdb.com/title/tt2855648/','https://www.imdb.com/title/tt0098999/','https://www.imdb.com/title/tt0432047/','https://www.imdb.com/title/tt3447364/','https://www.imdb.com/title/tt1014672/','https://www.imdb.com/title/tt1926313/','https://www.imdb.com/title/tt5286444/','https://www.imdb.com/title/tt2980794/','https://www.imdb.com/title/tt8042292/','https://www.imdb.com/title/tt1447500/','https://www.imdb.com/title/tt0106333/','https://www.imdb.com/title/tt2140465/','https://www.imdb.com/title/tt0920464/','https://www.imdb.com/title/tt5310090/','https://www.imdb.com/title/tt7212754/','https://www.imdb.com/title/tt1324059/','https://www.imdb.com/title/tt3767372/','https://www.imdb.com/title/tt2375559/','https://www.imdb.com/title/tt6027478/','https://www.imdb.com/title/tt8590896/','https://www.imdb.com/title/tt0172684/','https://www.imdb.com/title/tt6206564/','https://www.imdb.com/title/tt0449994/']]

现在我必须为该列表中的所有 250 个 URL 执行此操作。当循环这个过程时,我只得到最后一个 URL 信息。

这是我为 1 个 URL 编写的代码

def get_movie_info(a_tag,div_tag):

  # returns all the required info about a movie
  span_tags1 = a_tag.find_all('span')
  genre=span_tags1[0].text.strip()
  li_tags = div_tag.find_all('li')
  length_of_film=li_tags[1].text.strip()
  return genre,length_of_film 
  movie_page_url = links[0]       #1st url in the list
  response = requests.get(movie_page_url)

  #get a tags
  a_tags = movie_doc.find_all('a',attrs={'class':"GenresAndplot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"})

  #get div tags
  div_tags = movie_doc.find_all('div',attrs={'class':"TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"})

  movie_dict = {
    'genre1' : [],'length_of_movie' : []}

  a_tag = a_tags[0]
  div_tag = div_tags[0]

  movie_info = get_movie_info(a_tag,div_tag)
  movie_dict['genre1'].append(movie_info[0])
  movie_dict['length_of_movie'].append(movie_info[1])

输出

movie_dict = {'genre1': ['Crime'],'length_of_movie': ['2h 25min']}

输出应该是包含 'genre1' 和 'length_of_movie' 列和 250 行的数据框,分别包含电影的流派和长度

解决方法

使用电影 URL 遍历您的列表并将结果放入字典值中。最后一步,创建数据框:

import requests
from bs4 import BeautifulSoup

links = [
    "https://www.imdb.com/title/tt0093603/","https://www.imdb.com/title/tt8176054/","https://www.imdb.com/title/tt0367495/",# ... rest of your URLs
]


def get_movie_info(a_tag,div_tag):
    span_tags1 = a_tag.find_all("span")
    genre = span_tags1[0].text.strip()
    li_tag = div_tag.find(lambda tag: tag.name == "li" and "min" in tag.text)
    length_of_film = li_tag.text.strip()
    return genre,length_of_film


movie_dict = {"genre1": [],"length_of_movie": []}
for movie_page_url in links:
    response = requests.get(movie_page_url)
    movie_doc = BeautifulSoup(response.content,"html.parser")

    # get a tags
    a_tags = movie_doc.find_all(
        "a",attrs={
            "class": "GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"
        },)

    # get div tags
    div_tags = movie_doc.find_all(
        "div",attrs={
            "class": "TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"
        },)

    a_tag = a_tags[0]
    div_tag = div_tags[0]

    movie_info = get_movie_info(a_tag,div_tag)
    movie_dict["genre1"].append(movie_info[0])
    movie_dict["length_of_movie"].append(movie_info[1])

df = pd.DataFrame(movie_dict)
print(df)

打印:

      genre1 length_of_movie
0      Crime        2h 25min
1      Drama        2h 34min
2  Adventure        2h 40min

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...