lxml.xpath 未将元素放入列表的问题

问题描述

所以这是我的问题。我正在尝试使用 lxml 来抓取网站获取一些信息,但是在使用 var.xpath 命令时找不到与信息相关的元素。它正在查找页面,但在使用 xpath 后它没有找到任何内容

import requests
from lxml import html

def main():
   result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')

   # the root of the tracker website
   page = html.fromstring(result.content)
   print('its getting the element from here',page)
   
   threesRank = page.xpath('//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
   print('the 3s rank is: ',threesRank)

if __name__ == "__main__":
    main()

OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"

its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is:  []

Process finished with exit code 0

“the 3s rank is:”旁边的输出应该是这样的

[<Element html at 0x20eb01006d0>,<Element html at 0x20eb01006d0>,<Element html at 0x20eb01006d0>]


解决方法

由于 xpath 字符串不匹配,page.xpath(..) 没有返回任何结果集。很难准确说出您要查找的内容,但考虑到“threesRank”,我假设您正在查找所有表值,即。排名等。

您可以使用 Chrome 插件“Xpath helper”获得更准确和一目了然的 xpath。用法:进入站点并激活扩展。按住 shift 键并悬停在您感兴趣的元素上。

由于 tracker.network.com 使用的 HTML 是使用 javascript 和 BootstrapVue(和 Moment/Typeahead/jQuery)动态构建的,因此动态渲染有时会产生不同结果的风险很大时间。

我建议您不要抓取呈现的 html,而是使用呈现所需的结构化数据,在这种情况下,它以 json 形式存储在名为 __INITIAL_STATE__ 的 JavaScript 变量中

import requests
import re
import json
from contextlib import suppress

# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')

# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});",result.text).group(1)

# convert text string to structured json data
rocketleague = json.loads(json_string)

# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt','w') as outfile:
    outfile.write(json.dumps(rocketleague,indent=4,sort_keys=True))

# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])

# To avoid 'KeyError' when a key is missing or index is out of range,use "with suppress"
# as in the example below:  since there there is no platform no 99,the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress

with suppress(KeyError):
    platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
    platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']

# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
    print(platform['name'])

# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
    print(f"\nTitle: {title['name']}")
    for platform in title['platforms']:
        print(f"\tPlatform: {platform['name']}")
,

lxml 不支持“tbody”。将您的 xpath 更改为

'//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'