问题描述
我正在尝试使用 json_normalize 函数将 json 文件转换为数据帧。
源 JSON
-
json 是一个字典列表,看起来像这样:
{ "sport_key": "basketball_ncaab","sport_nice": "NCAAB","teams": [ "Bryant Bulldogs","Wagner Seahawks" ],"commence_time": 1608152400,"home_team": "Bryant Bulldogs","sites": [ { "site_key": "marathonbet","site_nice": "Marathon Bet","last_update": 1608156452,"odds": { "h2h": [ 1.28,3.54 ] } },{ "site_key": "sport888","site_nice": "888sport","odds": { "h2h": [ 1.13,5.8 ] } },{ "site_key": "unibet","site_nice": "Unibet","last_update": 1608156434,5.8 ] } } ],"sites_count": 3 }
问题是未来的一列包含一个列表(应该是这种情况),但是在 json_normalize 函数的元部分中包含此列会引发以下错误:
ValueError: operands Could not be broadcast together with shape (22,) (11,)
pd.json_normalize(data,'sites',['sport_key','sport_nice','home_team','teams'])
解决方法
假设 data
是字典列表,您仍然可以使用 json_normalize
,但您必须为 teams
中的每个对应字典单独分配 data
列:>
def normalize(d):
return pd.json_normalize(d,'sites',['sport_key','sport_nice','home_team'])\
.assign(teams=[d['teams']]*len(d['sites']))
df = pd.concat([normalize(d) for d in data],ignore_index=True)
或者您可以尝试:
data = [{**d,'teams': ','.join(d['teams'])} for d in data]
df = pd.json_normalize(data,'home_team','teams'])
df['teams'] = df['teams'].str.split(',')
结果:
site_key site_nice last_update odds.h2h sport_key sport_nice home_team teams
0 marathonbet Marathon Bet 1608156452 [1.28,3.54] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs,Wagner Seahawks]
1 sport888 888sport 1608156452 [1.13,5.8] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs,Wagner Seahawks]
2 unibet Unibet 1608156434 [1.13,Wagner Seahawks]