问题描述
使用 python,我试图从结构并不总是已知的 JSON 中提取名称“first”、“last”和“zipcode”的字段及其各自的值。 JSON 的示例可能如下所示:
{
"employees": [
{
"first": "Alice","last_name": "Alast","zipcode": "12345","role": "dev","nbr": 1,"team": [
{
"first_name": "fn","last_name": "ln"
},{
"first_name": "fn2","last_name": "ln2"
}
]
},{
"name": "Bob","nbr": 2
}
],"firm": {
"last_name": "Lhans","zipcode": "67890","location": "CA"
}}
除此之外,我还想把这个保存在一个数据结构中,比如:
{
{
first: "firstname",last: "lastname",zipcode: "zipcode"
}
}
我尝试将嵌套的 JSON 展平,将我的函数基于 this。我可以通过这种方式获取字段,但是很难找到以上述模型格式保存这些数据的最佳方法。如果其中一个字段为空,我想将该字段填充为 NaN 或空字符串,而不是完全忽略它。这是我到目前为止所拥有的,它创建了一个列表字段和值,但如果该字段不存在,它会跳过它而不是用无值填充它。
def flatten_json(nested_json,fields: list):
out = []
def flatten(x,name=''):
if type(x) is dict:
for a in x:
flatten(x[a],a)
elif type(x) is list:
i = 0
for a in x:
flatten(a)
i += 1
elif name in fields:
out.append(name+": "+x)
flatten(nested_json)
return out
这给了我类似的东西:
['first: Alice','last: Jones','zipcode: 12345','first: fn1','last: ln1','first: fn2','last: ln2','last: ln3','zipcode: 67890']
这并不理想。我宁愿用 NaN 或空字符串填充任何缺失的字段,而不是列表中不存在。
解决方法
我已修改您的函数以捕获字典列表。字典将只包含字段列表中指定的字段作为键。
import pandas as pd
def flatten_json(nested_json,fields):
out = []
temp = {}
def flatten(x,name=''):
nonlocal temp
if type(x) is dict:
temp = {}
for a in x:
flatten(x[a],a)
elif type(x) is list:
for i,a in enumerate(x):
flatten(a)
i += 1
elif name in fields:
temp[name] = x
out.append(temp)
flatten(nested_json)
return out
json1 = {"employees": [{"first": "Alice","last_name": "Alast","zipcode": "12345","role": "dev","nbr": 1,"team": [{"first_name": "fn","last_name": "ln"},{
"first_name": "fn2","last_name": "ln2"}]},{"name": "Bob","nbr": 2}],"firm": {"last_name": "Lhans","zipcode": "67890","location": "CA"}}
fields = ['first_name','last_name','zipcode']
result = (flatten_json(json1,fields))
然后可以将上述函数的输出加载到 Pandas 数据帧中 -
df = pd.DataFrame(result)
df.drop_duplicates(inplace=True)
print(df)
这将给出这样的输出 -
last_name zipcode first_name
0 Alast 12345 NaN
2 ln NaN fn
4 ln2 NaN fn2
6 Lhans 67890 NaN
现在,要以 JSON 格式获取数据,您可以使用 to_dict() 函数将数据帧转换回 dict -
print(df.to_dict(orient='records'))
输出-
[{'first_name': nan,'last_name': 'Alast','zipcode': '12345'},{'first_name': 'fn','last_name': 'ln','zipcode': nan},{'first_name': 'fn2','last_name': 'ln2',{'first_name': nan,'last_name': 'Lhans','zipcode': '67890'}]