使用pythonJupyter notebook对json数据进行数据预处理

问题描述

我正在尝试为 json 数据集实现一些预处理命令。使用 .csv 文件很容易,但我不知道如何实现一些预处理命令,如 isnull()、fillna()、dropna() 和 imputer 类。

以下是我已执行但未能执行上述操作的一些命令,因为我无法弄清楚如何使用 Json 文件数据集。

数据集链接https://drive.google.com/file/d/1puNNrRaV-Jt_kt709fuYGCvDW9-EuwoB/view?usp=sharing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json

dataset = pd.read_json('moviereviews.json',orient='columns')
print(dataset)

movies = pd.read_json( ( dataset).to_json(),orient='index')
print(movies)
print(type(movies))

movie = pd.read_json( ( dataset['12 Strong']).to_json(),orient='index')
print(movie)

movie_name = [
    "12 Strong","A Ciambra","All The Money In The World","Along With The Gods: The Two Worlds","Bilal: A New Breed Of Hero","Call Me By Your Name","Condorito: La Película","Darkest Hour","Den Of Thieves","Downsizing","Father figures","Film Stars Don'T Die In Liverpool","Forever My Girl","Happy End","Hostiles","I,Tonya","In The Fade (Aus Dem Nichts)","Insidious: The Last Key","Jumanji: Welcome To The Jungle","Mary And The Witch'S Flower","Maze Runner: The Death Cure","Molly'S Game","Paddington 2","Padmaavat","Phantom Thread","Pitch Perfect 3","Proud Mary","Star Wars: Episode Viii - The Last Jedi","Star Wars: The Last Jedi","The Cage fighter","The Commuter","The Final Year","The Greatest Showman","The Insult (L'Insulte)","The Post","The Shape Of Water","Una Mujer Fantástica","Winchester"
]
print(movie_name)

data = []
for moviename in movie_name:
    movie = pd.read_json( ( dataset[moviename]).to_json(),orient='index')
    data.append(movie)
   
print(data)

解决方法

您对这个数据集的挑战之一是它对相同的数据有不同的键名,例如 'Tomato Score''tomatoscore' 。下面的解决方案不是最好的,它可以优化很多,但是,我这样说是为了让您更容易看到为使数据一致而实施的步骤:

import pandas as pd

with open('moviereviews.json',"r") as read_file:
    dataset = json.load(read_file)

data = []

for index in range(len(dataset)):
    for key in dataset[index]:
        movie_name = key
        
        if 'Genre' in dataset[index][key]:
            genre = dataset[index][key]['Genre']
        else:
            genre = None
            
        if 'Gross' in dataset[index][key]:
            gross = dataset[index][key]['Gross']
        else:
            gross = None
            
        if 'IMDB Metascore' in  dataset[index][key]:
            imdb = dataset[index][key]['IMDB Metascore']            
        else:
            imdb = None
            
        if 'Popcorn Score' in dataset[index][key]:
            popcorn = dataset[index][key]['Popcorn Score']            
        elif 'popcornscore' in  dataset[index][key]:
            popcorn = dataset[index][key]['popcornscore']
        else:
            popcorn = None                                              
                                                      
        if 'Rating' in dataset[index][key]:
            rating = dataset[index][key]['Rating']                                     
        elif 'rating' in dataset[index][key]:
            rating = dataset[index][key]['rating']
        else:
            rating = None
            
        if 'Tomato Score' in dataset[index][key]:                                         
            tomato = dataset[index][key]['Tomato Score']                                       
        elif 'tomatoscore' in dataset[index][key]:
            tomato = dataset[index][key]['tomatoscore']                                              
        else:
            tomato = None
                
        data.append({'Movie Name': movie_name,'Genre': genre,'Gross': gross,'IMDB Metascore': imdb,'Popcorn Score': popcorn,'Rating': rating,'Tomato Score': tomato})
    
df = pd.DataFrame(data)

df
        

enter image description here

,

您可以将字典中的项目拆分并单独阅读,一次性将 NaN 填充为 None。

如果你的json被称为数据,那么

df = pd.DataFrame(data[0].values()).fillna('None')
df['Movie Name'] = pd.DataFrame(data[0].keys())
df.set_index('Movie Name',inplace=True)

df.head()

                                         Genre       Gross IMDB Metascore Popcorn Score   Rating Tomato Score popcornscore rating tomatoscore
Movie Name
12 Strong                               Action  $1,465,000             54            72        R           54         None   None        None
A Ciambra                                Drama     unknown             70       unknown  unrated       unkown         None   None        None
All The Money In The World                None        None           None          None     None         None         72.0      R        76.0
Along With The Gods: The Two Worlds       None        None           None          None     None         None         90.0     NR        50.0
Bilal: A New Breed Of Hero           Animation     unknown             52       unknown  unrated       unkown         None   None        None

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...