如何将AWS CloudTrail JSON日志读取到熊猫数据框中? test.json test2.json

问题描述

我遇到了问题,因为当我的VM突然崩溃时,我正在使用与Anaconda3一起运行的Jupyterlab将数据加载到熊猫中。启动后,我发现我的代码由于某种原因不再起作用。这是我的代码:

awsc = pd.DataFrame()
json_pattern = os.path.join('logs_old/AWSCloudtrailLog/','*')
file_list = glob.glob(json_pattern)
for file in file_list:
    data = pd.read_json(file,lines=True)
    awsc = awsc.append(data,ignore_index = True)
awsc = pd.concat([awsc,pd.json_normalize(awsc['userIdentity'])],axis=1).drop('userIdentity',1)
awsc.rename(columns={'type':'userIdentity_type','principalId':'userIdentity_principalId','arn':'userIdentity_arn','accountId':'userIdentity_accountId','accessKeyId':'userIdentity_accessKeyId','userName':'userIdentity_userName',},inplace=True)

运行代码时,它会给我这样的KeyError消息:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/environment/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self,key,method,tolerance)
   2888             try:
-> 2889                 return self._engine.get_loc(casted_key)
   2890             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'userIdentity'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-9-efd1d2e600a5> in <module>
      1 # unpack nested json
      2 
----> 3 awsc = pd.concat([awsc,1)
      4 awsc.rename(columns={'type':'userIdentity_type',5                      'principalId':'userIdentity_principalId',~/anaconda3/envs/environment/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self,key)
   2900             if self.columns.nlevels > 1:
   2901                 return self._getitem_multilevel(key)
-> 2902             indexer = self.columns.get_loc(key)
   2903             if is_integer(indexer):
   2904                 indexer = [indexer]

~/anaconda3/envs/environment/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self,tolerance)
   2889                 return self._engine.get_loc(casted_key)
   2890             except KeyError as err:
-> 2891                 raise KeyError(key) from err
   2892 
   2893         if tolerance is not None:

KeyError: 'userIdentity'

当我运行print(awss.info())或print(awsc.info())时,数据帧awsc的输出:

 <class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrameNone

有解决此问题的解决方案吗?问题是出自熊猫还是水蟒?

解决方法

使用OP中的代码

  • 创建数据框的方法不正确,这是awsc为空的方式。
  • 没有看到文件,就无法知道pd.read_json(file,lines=True)是否是正确的使用方法。
  • pd.json_normalize(awsc['userIdentity'])将在dicts的列上工作。尽管该列很可能是字符串。
    • 如果dictsstr类型,请使用ast.literal_eval将它们转换为dict类型。
import pandas as pd
from ast import literal_eval

# crate a list to add dataframes to
awsc_list = list()

# iterate through the list of and append them to awsc_list
for file in file_list:
    awsc_list.append(pd.read_json(file,lines=True))
    
# concat the files into a single dataframe
awsc = pd.concat(awsc_list).reset_index(drop=True)

# convert the userIdentity column to dict type,if it contains str type
awsc.userIdentity = awsc.userIdentity.apply(literal_eval)

# normalize userIdentity
normalized = pd.json_normalize(awsc['userIdentity'],sep='_')

# join awsc with normalized and drop the userIdentity column
awsc = awsc.join(normalized).drop('userIdentity',1)

# rename the columns
awsc.rename(columns={'type':'userIdentity_type','principalId':'userIdentity_principalId','arn':'userIdentity_arn','accountId':'userIdentity_accountId','accessKeyId':'userIdentity_accessKeyId','userName':'userIdentity_userName',},inplace=True)

带有示例数据的新代码

import json
import pandas as pd

# crate a list to add dataframes to
awsc_list = list()

# list of files
files_list = ['test.json','test2.json']

# read the filess
for file in files_list:
    with open(file,'r',encoding='utf-8') as f:
        data = json.loads(f.read())
    
    # normalize the file and append it to the list of dataframe
    awsc_list.append(pd.json_normalize(data,'Records',sep='_'))
    
# concat the files into a single dataframe
awsc = pd.concat(awsc_list).reset_index(drop=True)

# display(awsc)
  eventVersion             eventTime        eventSource       eventName  awsRegion  sourceIPAddress                                                                                 userAgent userIdentity_type userIdentity_principalId                      userIdentity_arn userIdentity_accessKeyId userIdentity_accountId userIdentity_userName requestParameters_instancesSet_items                                                                                                 responseElements_instancesSet_items requestParameters_force userIdentity_sessionContext_attributes_mfaAuthenticated userIdentity_sessionContext_attributes_creationDate requestParameters_keyName responseElements_keyName                              responseElements_keyFingerprint responseElements_keyMaterial
0          1.0  2014-03-06T21:22:54Z  ec2.amazonaws.com  StartInstances  us-east-2  205.251.233.176                                                                    ec2-api-tools 1.6.12.2           IAMUser          EX_PRINCIPAL_ID  arn:aws:iam::123456789012:user/Alice           EXAMPLE_KEY_ID           123456789012                 Alice       [{'instanceId': 'i-ebeaf9e2'}]    [{'instanceId': 'i-ebeaf9e2','currentState': {'code': 0,'name': 'pending'},'previousState': {'code': 80,'name': 'stopped'}}]                     NaN                                                     NaN                                                 NaN                       NaN                      NaN                                                          NaN                          NaN
1          1.0  2014-03-06T21:01:59Z  ec2.amazonaws.com   StopInstances  us-east-2  205.251.233.176                                                                    ec2-api-tools 1.6.12.2           IAMUser          EX_PRINCIPAL_ID  arn:aws:iam::123456789012:user/Alice           EXAMPLE_KEY_ID           123456789012                 Alice       [{'instanceId': 'i-ebeaf9e2'}]  [{'instanceId': 'i-ebeaf9e2','currentState': {'code': 64,'name': 'stopping'},'previousState': {'code': 16,'name': 'running'}}]                   False                                                     NaN                                                 NaN                       NaN                      NaN                                                          NaN                          NaN
2          1.0  2014-03-06T17:10:34Z  ec2.amazonaws.com   CreateKeyPair  us-east-2     72.21.198.64  EC2ConsoleBackend,aws-sdk-java/Linux/x.xx.fleetxen Java_HotSpot(TM)_64-Bit_Server_VM/xx           IAMUser          EX_PRINCIPAL_ID  arn:aws:iam::123456789012:user/Alice           EXAMPLE_KEY_ID           123456789012                 Alice                                  NaN                                                                                                                                 NaN                     NaN                                                   false                                2014-03-06T15:15:06Z                 mykeypair                mykeypair  30:1d:46:d0:5b:ad:7e:1b:b6:70:62:8b:ff:38:b5:e9:ab:5d:b8:21       <sensitiveDataRemoved>
3          1.0  2014-03-06T21:22:54Z  ec2.amazonaws.com  StartInstances  us-east-2  205.251.233.176                                                                    ec2-api-tools 1.6.12.2           IAMUser          EX_PRINCIPAL_ID  arn:aws:iam::123456789012:user/Alice           EXAMPLE_KEY_ID           123456789012                 Alice       [{'instanceId': 'i-ebeaf9e2'}]    [{'instanceId': 'i-ebeaf9e2','name': 'stopped'}}]                     NaN                                                     NaN                                                 NaN                       NaN                      NaN                                                          NaN                          NaN

样本数据

test.json

  • JSON列表
[{
        "Records": [{
                "eventVersion": "1.0","userIdentity": {
                    "type": "IAMUser","principalId": "EX_PRINCIPAL_ID","arn": "arn:aws:iam::123456789012:user/Alice","accessKeyId": "EXAMPLE_KEY_ID","accountId": "123456789012","userName": "Alice"
                },"eventTime": "2014-03-06T21:22:54Z","eventSource": "ec2.amazonaws.com","eventName": "StartInstances","awsRegion": "us-east-2","sourceIPAddress": "205.251.233.176","userAgent": "ec2-api-tools 1.6.12.2","requestParameters": {
                    "instancesSet": {
                        "items": [{
                                "instanceId": "i-ebeaf9e2"
                            }
                        ]
                    }
                },"responseElements": {
                    "instancesSet": {
                        "items": [{
                                "instanceId": "i-ebeaf9e2","currentState": {
                                    "code": 0,"name": "pending"
                                },"previousState": {
                                    "code": 80,"name": "stopped"
                                }
                            }
                        ]
                    }
                }
            }
        ]
    },{
        "Records": [{
                "eventVersion": "1.0","eventTime": "2014-03-06T21:01:59Z","eventName": "StopInstances","requestParameters": {
                    "instancesSet": {
                        "items": [{
                                "instanceId": "i-ebeaf9e2"
                            }
                        ]
                    },"force": false
                },"currentState": {
                                    "code": 64,"name": "stopping"
                                },"previousState": {
                                    "code": 16,"name": "running"
                                }
                            }
                        ]
                    }
                }
            }
        ]
    },"userName": "Alice","sessionContext": {
                        "attributes": {
                            "mfaAuthenticated": "false","creationDate": "2014-03-06T15:15:06Z"
                        }
                    }
                },"eventTime": "2014-03-06T17:10:34Z","eventName": "CreateKeyPair","sourceIPAddress": "72.21.198.64","userAgent": "EC2ConsoleBackend,aws-sdk-java/Linux/x.xx.fleetxen Java_HotSpot(TM)_64-Bit_Server_VM/xx","requestParameters": {
                    "keyName": "mykeypair"
                },"responseElements": {
                    "keyName": "mykeypair","keyFingerprint": "30:1d:46:d0:5b:ad:7e:1b:b6:70:62:8b:ff:38:b5:e9:ab:5d:b8:21","keyMaterial": "\u003csensitiveDataRemoved\u003e"
                }
            }
        ]
    }
]

test2.json

  • 一个JSON
{
    "Records": [{
            "eventVersion": "1.0","userIdentity": {
                "type": "IAMUser","userName": "Alice"
            },"requestParameters": {
                "instancesSet": {
                    "items": [{
                            "instanceId": "i-ebeaf9e2"
                        }
                    ]
                }
            },"responseElements": {
                "instancesSet": {
                    "items": [{
                            "instanceId": "i-ebeaf9e2","currentState": {
                                "code": 0,"name": "pending"
                            },"previousState": {
                                "code": 80,"name": "stopped"
                            }
                        }
                    ]
                }
            }
        }
    ]
}

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...