使用beautifulsoup在不同选项卡中打开产品页面以输入亚马逊的搜索结果

问题描述

我是python的新手，也是Web抓取的新手-目前在Al Sweigart的书使用Python自动完成无聊的事情中，还有一个建议的练习作业，基本上是做一个执行此操作的程序：

输入要在亚马逊中搜索的产品
使用request.get（）和.text（）获取该搜索页面的html
使用beautifulsoup搜索html以查找表示产品页面链接的css选择器
在单独的标签中，打开指向搜索结果前五种产品的标签

这里是我的代码：

#! python3
# Searches amazon for the inputted product (either through command line or input) and opens 5 tabs with the top 
# items for that search. 

    import requests,sys,bs4,webbrowser
    if len(sys.argv) > 1: # if there are system arguments
        res = requests.get('https://www.amazon.com/s?k=' + ''.join(sys.argv))
        res.raise_for_status
    else: # take input
        print('what product would you like to search Amazon for?')
        product = str(input())
        res = requests.get('https://www.amazon.com/s?k=' + ''.join(product))
        res.raise_for_status
    
    # retrieve top search links:
    soup = bs4.BeautifulSoup(res.text,'html.parser')
    
    print(res.text) # TO CHECK HTML OF SITE,GET RID OF DURING ACTUAL PROGRAM
    # open a new tab for the top 5 items,and get the css selector for links 
    # a list of all things on the downloaded page that are within the css selector 'a-link-normal a-text-normal'
    linkElems = soup.select('a-link-normal a-text-normal') 
    
    numOpen = min(5,len(linkElems))
    for i in range(numOpen):
        urlToOpen = 'https://www.amazon.com/' + linkElems[i].get('href')
        print('opening',urlToOpen)
        webbrowser.open(urlToOpen)

我认为我选择了正确的CSS选择器（“ a-link-normal a-text-normal”），所以我认为问题出在res.text（）上-当我打印以查看其外观时例如，当我使用chrome中的inspect元素查看同一站点时，html内容似乎并不完整，或包含实际html的内容。此外，该html均不包含“ a-link-normal a-text-normal”之类的任何内容。

仅作为示例，这就是搜索“大铅笔”时res.text（）的样子：

what product would you like to search Amazon for?
big pencil
<!--
        To discuss automated access to Amazon data please contact [email protected].
        For @R_740_4045@ion about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv,or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<!doctype html>
<html>
<head>
  <Meta charset="utf-8">
  <Meta http-equiv="x-ua-compatible" content="ie=edge">
  <Meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no">
  <title>Sorry! Something went wrong!</title>
  <style>
  html,body {
    padding: 0;
    margin: 0
  }

  img {
    border: 0
  }

  #a {
    background: #232f3e;
    padding: 11px 11px 11px 192px
  }

  #b {
    position: absolute;
    left: 22px;
    top: 12px
  }

  #c {
    position: relative;
    max-width: 800px;
    padding: 0 40px 0 0
  }

  #e,#f {
    height: 35px;
    border: 0;
    font-size: 1em
  }

  #e {
    width: 100%;
    margin: 0;
    padding: 0 10px;
    border-radius: 4px 0 0 4px
  }

  #f {
    cursor: pointer;
    background: #febd69;
    font-weight: bold;
    border-radius: 0 4px 4px 0;
    -webkit-appearance: none;
    position: absolute;
    top: 0;
    right: 0;
    padding: 0 12px
  }

  @media (max-width: 500px) {
    #a {
      padding: 55px 10px 10px
    }

    #b {
      left: 6px
    }
  }

  #g {
    text-align: center;
    margin: 30px 0
  }

  #g img {
    max-width: 90%
  }

  #d {
    display: none
  }

  #d[src] {
    display: inline
  }
  </style>
</head>
<body>
    <a href="/ref=cs_503_logo"><img id="b" src="https://images-na.ssl-images-amazon.com/images/G/01/error/logo._TTD_.png" alt="Amazon.com"></a>
    <form id="a" accept-charset="utf-8" action="/s" method="GET" role="search">
        <div id="c">
            <input id="e" name="field-keywords" placeholder="Search">
            <input name="ref" type="hidden" value="cs_503_search">
            <input id="f" type="submit" value="Go">
        </div>
    </form>
<div id="g">
  <div><a href="/ref=cs_503_link"><img src="https://images-na.ssl-images-amazon.com/images/G/01/error/500_503.png"
                                        alt="Sorry! Something went wrong on our end. Please go back and try again or go to Amazon's home page."></a>
  </div>
  <a href="/dogsofamazon/ref=cs_503_d" target="_blank" rel="noopener noreferrer"><img id="d" alt="Dogs of Amazon"></a>
  <script>document.getElementById("d").src = "https://images-na.ssl-images-amazon.com/images/G/01/error/" + (Math.floor(Math.random() * 43) + 1) + "._TTD_.jpg";</script>
</div>
</body>
</html>

非常感谢您的耐心等候。

解决方法

这是一个经典案例，如果您尝试使用类似BeautifulSoup的抓取工具直接刮取网站，您将一无所获。

该网站的工作方式是，首先将与您为big pencil添加的代码相同的代码初始部分下载到浏览器，然后通过Javascript加载页面上的其余元素。

您将需要使用Selenium Webdriver首先加载页面，然后从浏览器中获取代码。在正常情况下，这相当于您打开浏览器的控制台，转到 Elements 标签并查找您提到的类。

要了解差异，建议您查看页面的源代码，并将其与“元素”标签中的代码进行比较

在这里，您需要使用

通过BS4获取加载到浏览器的数据

from selenium import webdriver

browser = webdriver.Chrome("path_to_chromedriver") # This is the Chromedriver which will open up a new instance of a browser for you. More info in the docs

browser.get(url) # Fetch the URL on the browser

soup = bs4.BeautifulSoup(browser.page_source,'html.parser') # Now load it to BS4 and go on with extracting the elements and so on

这是了解Selenium的非常基本的代码，但是，在生产用例中，您可能需要使用PhantomJS之类的无头浏览器

参考：