如何将 json 文件中的嵌套字典进一步解析为 Python 中的数据帧

问题描述

我有一个非常大的 json 文件,我想将其转换为具有所需结构的数据框,稍后将在问题中解释。

示例 json 的一些记录如下所示:

JsonRecords = {
         'rec1': 
              {
                'words':[  ['A','B','C','.'],['D','E','F','.']],'Ids':[  [0,1],[2,3]],'unique':[1,1,'ments': {
                          "(0,1)":{
                                    "A1": [0],"A2": [0,"A3": [1],"A4": [1,0],"A5": [0] 
                                   },"(2,3)": {
                                    "A1": [0],"A2": [0],"A5": [0] 
                                   }                  
                          }
              },'rec2': 
             {
               'words':[   ['We','us','them',['is','it','us''.'    ]],'Ids':[   [4,5],[6,7]],'unique':[0,"ments": {
                         "(4,5)": {
                                    "A1": [0],"A3": [0],"A4": [0] 
                                   },"(6,7)": {
                                    "A1": [0],"A4": [0,"A6": [0,1]
                                  }
                     }
             },'rec3':             
     ..... more records
}

我使用以下代码解析了 json 示例:

  import pandas as pd
  #import json

  all_data = []
  for k,v in JsonRecords.items():
     words,Ids,unique,ments = v['words'],v['Ids'],v['unique'],v['ments']
     for t,val,m in zip(words,ments.items()):
       all_data.append({
        'records': k,'words': ' '.join(t),'Ids': val,'unique': unique,'ments': m            
        })
  #print(all_data)
  df = pd.DataFrame(all_data)
  df.to_csv('myData.csv',encoding='utf-8')
  print(df.head())

当我运行代码时,我得到以下数据帧结构:

 records     words          Ids         unique                    ments                    
  rec1      A,B,C.       [0,1]   [1,1]   ('(0,1)',{'A1': [0],'A2': [0,'A3': [1],'A4': [1,'A5': [0]})                          
  rec1      D,E,F.       [2,3]   [1,1]   ('(2,3)','A2': [0],'A5': [0]})                          
  rec2      We,us,them.  [4,5]   [0,0]   ('(4,5)','A3': [0],'A4': [0]})                            
  rec2      is,it,us.    [6,7]   [0,0]   ('(6,7)','A4': [0,'A6': [0,1]})                        
  rec3  

如上所示,我无法根据 'Ids' 和 'words' 列进一步解析 'ments' 字典,这也应该通过解析 'ments' 字典及其嵌套值来重复。

我想要的这个嵌套 json 的数据帧结构如下所示。

Records       words          Ids     unique                 ments    A1  A2  A3  A4  A5  A6
  rec1      A,1]     [0,1]     0   0   1   1   0 
  rec1      A,1]         1       0      
  rec1      D,1]     [2,3]     0   0   1       0  
  rec1      D,3]                       
  rec2      We,0]     [4,5]     0   0   0   0     
  rec2      We,5]                       
  rec2      is,0]     [6,7]     0   0       0       0
  rec2      is,7]                 0       1
  rec3 
  ....... more records

我会感谢一些帮助..

解决方法

使用 apply 和 json_normalize

def getMents(value):
    return value[0]
def getJson(value):
    return value[1]
df = pd.DataFrame(all_data)
df['json'] = df['ments'].apply(getJson)
jsonData = pd.json_normalize(df['json'])
df['ments'] = df['ments'].apply(getMents)
for col in jsonData.columns.values:
    df[col] = jsonData[col]
new_df = df[0:0]
results= df[0:0]
for index,row in df.iterrows():
    maxCount = 0
    for col in jsonData.columns.values:
        if isinstance(row[col],list):
            maxCount = max(maxCount,len(row[col]))
    for i in range(0,maxCount):
        count = len(new_df)
        new_df.loc[count] = row
    
        for col in jsonData.columns.values:
            if isinstance(new_df[col][i],list):        
                try:
                    new_df.loc[i,col]= new_df[col][i][i]
                except IndexError:
                    new_df.loc[i,col]=None
    results = pd.concat([results,new_df])
    new_df = df[0:0]

results