Json 需要解析两次吗？

问题描述

我有一个 json 字符串，是我从一个带有转义字符格式的 bs4 网站收集的：

尝试解析时的代码：

data.html：

<script> 
var variable_json = JSON.parse("{\u0022id\u0022:1990,\u0022media_id\u0022:\u00225299\u0022}")
</script>

抓取 html 数据：

soup = BeautifulSoup(data.html,"html.parser")
script = (soup.find("script").strip().replace("var variable_json = JSON.parse(","").replace(');','')

json_dict = json.loads(script)

输出：

{"id":1990,"media_id":"5299"}
*This does not work*

当我尝试获取一个键的值时，它返回一个错误：json_dict["id"] ， TypeError: string indices must be integers

但是我最近发现了一个临时解决方案来解决这个问题，我必须使用 json.loads 解析它两次：

解决方案代码：

json_dict = json.loads(json.loads(script))

输出：

{'id' : 1990,'media_id' : '5299'}
*This works the best*

这实际上就像一个字典而不是一个字符串对象

现在是我的问题 实际上没有比解析两次更好的方法了吗？或者有更好的pythonic方法吗？

我有几个假设，即有一个特定的函数可以在不使用 \u0022Text\u0022 的情况下将转义字符 "Text" 解析为 json.loads()，所以如果有的话我很想知道。

解决方法

不，不要加载两次。

我认为 collected 的定义中至少有 1 个错字。至少它需要一个结束双引号。我忽略了它，只是使用了您提供的 json_dict 的值：

{"id":1990,"media_id":"5299"}

如您所见，访问 id 的值没有问题：

$ python
Python 3.8.6 (default,Jan 27 2021,15:42:20)
[GCC 10.2.0] on linux
Type "help","copyright","credits" or "license" for more information.
>>> json_dict={"id":1990,"media_id":"5299"}
>>> json_dict['id']
1990
>>> json_dict["media_id"]
'5299'

请仔细阅读以下内容 --> here 以了解 JSON.parse 脚本标记中 HTML 的含义。

请考虑以下几点：

要摆脱 HTML 源代码中的 Unicode，您必须使用真正的解析器，例如 lxml，根据 documentation
您还可以使用正则表达式来解析它并在 {...} 之间提取内容，您可以在其中将其加载到 JSON 中，这将删除 Unicode。

import re
import json

html = """<script> 
var variable_json = JSON.parse("{\u0022id\u0022:1990,\u0022media_id\u0022:\u00225299\u0022}")
</script>"""


match = json.loads(re.search(r'({.*?})',html).group(1))
print(match)

输出：

{'id': 1990,'media_id': '5299'}

或

import json
from bs4 import BeautifulSoup

html = """<script> 
var variable_json = JSON.parse("{\u0022id\u0022:1990,\u0022media_id\u0022:\u00225299\u0022}")
</script>"""


soup = BeautifulSoup(html,'lxml')

print(json.loads(soup.select_one('script').string.split('"',1)[-1][:-3]))

输出：

{'id': 1990,'media_id': '5299'}

beautifulsoup dictionary json python-3.x unicode-escapes