如何从嵌套的 json 中提取字段并保存在数据结构中

问题描述

使用 python,我试图从结构并不总是已知的 JSON 中提取名称“first”、“last”和“zipcode”的字段及其各自的值。 JSON 的示例可能如下所示:

{
"employees": [
    {
        "first": "Alice","last_name": "Alast","zipcode": "12345","role": "dev","nbr": 1,"team": [
            {
                "first_name": "fn","last_name": "ln"
            },{
                "first_name": "fn2","last_name": "ln2"
            }
        ]
    },{
        "name": "Bob","nbr": 2
    }
],"firm": {
    "last_name": "Lhans","zipcode": "67890","location": "CA"
}}

除此之外,我还想把这个保存在一个数据结构中,比如:

{ 
  {
    first: "firstname",last: "lastname",zipcode: "zipcode"
  }
}

我尝试将嵌套的 JSON 展平,将我的函数基于 this。我可以通过这种方式获取字段,但是很难找到以上述模型格式保存这些数据的最佳方法。如果其中一个字段为空,我想将该字段填充为 NaN 或空字符串,而不是完全忽略它。这是我到目前为止所拥有的,它创建了一个列表字段和值,但如果该字段不存在,它会跳过它而不是用无值填充它。

def flatten_json(nested_json,fields: list):
    out = []
    
    def flatten(x,name=''):
            if type(x) is dict:
                for a in x:
                    flatten(x[a],a)
            elif type(x) is list:
                i = 0
                for a in x:
                    flatten(a)
                    i += 1
            elif name in fields:
                out.append(name+": "+x)
    flatten(nested_json)
    return out

这给了我类似的东西:

['first: Alice','last: Jones','zipcode: 12345','first: fn1','last: ln1','first: fn2','last: ln2','last: ln3','zipcode: 67890']

这并不理想。我宁愿用 NaN 或空字符串填充任何缺失的字段,而不是列表中不存在。

解决方法

我已修改您的函数以捕获字典列表。字典将只包含字段列表中指定的字段作为键。


import pandas as pd


def flatten_json(nested_json,fields):
    out = []
    temp = {}

    def flatten(x,name=''):
        nonlocal temp
        if type(x) is dict:
            temp = {}
            for a in x:
                flatten(x[a],a)
        elif type(x) is list:
            for i,a in enumerate(x):
                flatten(a)
                i += 1
        elif name in fields:
            temp[name] = x
            out.append(temp)
    flatten(nested_json)
    return out


json1 = {"employees": [{"first": "Alice","last_name": "Alast","zipcode": "12345","role": "dev","nbr": 1,"team": [{"first_name": "fn","last_name": "ln"},{
    "first_name": "fn2","last_name": "ln2"}]},{"name": "Bob","nbr": 2}],"firm": {"last_name": "Lhans","zipcode": "67890","location": "CA"}}

fields = ['first_name','last_name','zipcode']
result = (flatten_json(json1,fields))

然后可以将上述函数的输出加载到 Pandas 数据帧中 -

df = pd.DataFrame(result)
df.drop_duplicates(inplace=True)
print(df)

这将给出这样的输出 -

  last_name zipcode first_name
0     Alast   12345        NaN
2        ln     NaN         fn
4       ln2     NaN        fn2
6     Lhans   67890        NaN

现在,要以 JSON 格式获取数据,您可以使用 to_dict() 函数将数据帧转换回 dict -

print(df.to_dict(orient='records'))

输出-

[{'first_name': nan,'last_name': 'Alast','zipcode': '12345'},{'first_name': 'fn','last_name': 'ln','zipcode': nan},{'first_name': 'fn2','last_name': 'ln2',{'first_name': nan,'last_name': 'Lhans','zipcode': '67890'}]