问题描述
我有一个非常大的 json 文件,我想将其转换为具有所需结构的数据框,稍后将在问题中解释。
示例 json 的一些记录如下所示:
JsonRecords = {
'rec1':
{
'words':[ ['A','B','C','.'],['D','E','F','.']],'Ids':[ [0,1],[2,3]],'unique':[1,1,'ments': {
"(0,1)":{
"A1": [0],"A2": [0,"A3": [1],"A4": [1,0],"A5": [0]
},"(2,3)": {
"A1": [0],"A2": [0],"A5": [0]
}
}
},'rec2':
{
'words':[ ['We','us','them',['is','it','us''.' ]],'Ids':[ [4,5],[6,7]],'unique':[0,"ments": {
"(4,5)": {
"A1": [0],"A3": [0],"A4": [0]
},"(6,7)": {
"A1": [0],"A4": [0,"A6": [0,1]
}
}
},'rec3':
..... more records
}
我使用以下代码解析了 json 示例:
import pandas as pd
#import json
all_data = []
for k,v in JsonRecords.items():
words,Ids,unique,ments = v['words'],v['Ids'],v['unique'],v['ments']
for t,val,m in zip(words,ments.items()):
all_data.append({
'records': k,'words': ' '.join(t),'Ids': val,'unique': unique,'ments': m
})
#print(all_data)
df = pd.DataFrame(all_data)
df.to_csv('myData.csv',encoding='utf-8')
print(df.head())
当我运行代码时,我得到以下数据帧结构:
records words Ids unique ments
rec1 A,B,C. [0,1] [1,1] ('(0,1)',{'A1': [0],'A2': [0,'A3': [1],'A4': [1,'A5': [0]})
rec1 D,E,F. [2,3] [1,1] ('(2,3)','A2': [0],'A5': [0]})
rec2 We,us,them. [4,5] [0,0] ('(4,5)','A3': [0],'A4': [0]})
rec2 is,it,us. [6,7] [0,0] ('(6,7)','A4': [0,'A6': [0,1]})
rec3
如上所示,我无法根据 'Ids' 和 'words' 列进一步解析 'ments' 字典,这也应该通过解析 'ments' 字典及其嵌套值来重复。
我想要的这个嵌套 json 的数据帧结构如下所示。
Records words Ids unique ments A1 A2 A3 A4 A5 A6
rec1 A,1] [0,1] 0 0 1 1 0
rec1 A,1] 1 0
rec1 D,1] [2,3] 0 0 1 0
rec1 D,3]
rec2 We,0] [4,5] 0 0 0 0
rec2 We,5]
rec2 is,0] [6,7] 0 0 0 0
rec2 is,7] 0 1
rec3
....... more records
我会感谢一些帮助..
解决方法
使用 apply 和 json_normalize
def getMents(value):
return value[0]
def getJson(value):
return value[1]
df = pd.DataFrame(all_data)
df['json'] = df['ments'].apply(getJson)
jsonData = pd.json_normalize(df['json'])
df['ments'] = df['ments'].apply(getMents)
for col in jsonData.columns.values:
df[col] = jsonData[col]
new_df = df[0:0]
results= df[0:0]
for index,row in df.iterrows():
maxCount = 0
for col in jsonData.columns.values:
if isinstance(row[col],list):
maxCount = max(maxCount,len(row[col]))
for i in range(0,maxCount):
count = len(new_df)
new_df.loc[count] = row
for col in jsonData.columns.values:
if isinstance(new_df[col][i],list):
try:
new_df.loc[i,col]= new_df[col][i][i]
except IndexError:
new_df.loc[i,col]=None
results = pd.concat([results,new_df])
new_df = df[0:0]
results