无法获取网页内容

问题描述

我正在写一个脚本,需要例如这个网页的内容

https://pcb.inc.hp.com/webapp/#/nl-nl/contents/33128146?type=I&hierarchy=F&status=L&status=O

我正在使用scrapy,通常都可以解决,但我目前无法使用请求或scrapy 或任何其他模块获取页面的html。有人知道会出什么问题吗?

解决方法

某些网站使用 Javascript 动态加载数据。

对于这些情况,我们使用 ScrapySplash,它使用无头浏览器为您加载。

检查文档here

,

该网站使用 AngularJS 在加载时动态生成内容。您不能直接从本网站抓取内容,我建议您使用 Selenium 之类的东西和 Python 来抓取数据。

或者相反,根据您的需要,您可以查看 Network 中的 Chrome Dev Tools 标签以查看发出的请求,并从这些 URL 中抓取数据。

例如

Request URL: https://pcb.inc.hp.com/api/catalogs/nl-nl/nodes/0/children?status[]=O&status[]=L&hierParadigm=F

Response: {"baseProdname":"ROOT_NODE","oid":0,"level":0,"status":["O","L"],"cultureCode":"nl-nl","children":[{"baseProdname":"Solutions","oid":8176594,"level":1,"status":["L","O"],"cultureCode":"nl-nl"},{"baseProdname":"Scanners/Copiers/Faxes","oid":15179,{"baseProdname":"Software","oid":8133386,{"baseProdname":"Ink/Toner/Paper/Printer Supplies","oid":12771,{"baseProdname":"Laptops and Hybrids","oid":321957,{"baseProdname":"Printers and Multifunction","oid":18972,{"baseProdname":"Point of Sale Systems","oid":7491307,{"baseProdname":"Desktops & Workstations","oid":12454,{"baseProdname":"Monitors","oid":382087,{"baseProdname":"Services","oid":8362107,{"baseProdname":"Accessories","oid":8386448,{"baseProdname":"3D Materials and Consumables","oid":20063457,{"baseProdname":"Handhelds and Calculators","oid":215348,{"baseProdname":"Industries","oid":20008722,"status":["L"],{"baseProdname":"Tablets","oid":5169094,"status":["O"],{"baseProdname":"Projectors","oid":3338965,{"baseProdname":"Digital Cameras and Photo Studios","oid":382085,"cultureCode":"nl-nl"}]}