如何将高音数据保存在 json 文件中?

问题描述

我正在使用 Twython 从高音扬声器中抓取数据。我可以成功完成这项工作。但是,为了进一步处理数据,我需要将高音数据保存为 JSON 或任何其他可以用 Pandas 打开的格式。

我想包括抓取结果中的每一列,包括语言位置、转推等。我知道如何为几列执行此操作,但找不到有关如何包含所有列的信息。

import json
credentials = {}
credentials['CONSUMER_KEY'] = '...'
credentials['CONSUMER_SECRET'] = '...'
credentials['ACCESS_TOKEN'] = '...'
credentials['ACCESS_SECRET'] = '...'

# Save the credentials object to file
with open("twitter_credentials.json","w") as file:
    json.dump(credentials,file)

# Import the Twython class
from twython import Twython
import json

# Load credentials from json file
with open("twitter_credentials.json","r") as file:
    creds = json.load(file)

# Instantiate an object
python_tweets = Twython(creds['CONSUMER_KEY'],creds['CONSUMER_SECRET'])

python_tweets.search(q='#python',result_type='popular',count=5)

OUTPUT:
{'statuses': [{'created_at': 'Mon Dec 14 04:05:03 +0000 2020','id': 1338334158205169664,'id_str': '1338334158205169664','text': '?  Hmmm...this looks right,doesn’t it? We’ll give you a hint - the result is meant to be 36!\n\nCan you find the err… ','truncated': True,'entities': {'hashtags': [],'symbols': [],'user_mentions': [],'urls': [{'url': '','expanded_url': '','display_url': 'twitter.com/i/web/status/1…','indices': [117,140]}]},'Metadata': {'result_type': 'popular','iso_language_code': 'en'},'source': '<a href=">','in_reply_to_status_id': None,'in_reply_to_status_id_str': None,'in_reply_to_user_id': None,'in_reply_to_user_id_str': None,and so on

我的问题是:如何将我从推特上获得的数据保存为 json 格式,以便我最近可以用 Pandas 打开它。我基本上只是想以某种方式用熊猫打开它。

我尝试了以下代码

data= {}
data[python_tweets.search(q='#python',count=5)]
with open("twitter_new.json","w") as file:
    json.dump(data,file)

TypeError: unhashable type: 'dict'


data=python_tweets.search(q='#python',count=5)
df = pd.DataFrame(data)

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

解决方法

要保存 search() 的结果,您应该简单地赋值给变量 data = ... 并保存

data = client.search(q='#python',result_type='popular',count=5)

with open('tweets_python.json','w') as fh:
    json.dump(data,fh)

但是这个 JSON 具有复杂的结构 - 它有两个不同的子词典 data['statuses']data['search_metadata'],它们不能一起转换为一个 DataFrame。但可能您只需要来自 data['statuses'] 的值(即使没有保存在文件中)

 df = pd.DataFrame(data['statuses'])
 print(df)

结果:

                       created_at                   id               id_str  ... retweeted  possibly_sensitive lang
0  Sun Dec 20 15:14:21 +0000 2020  1340676922230136833  1340676922230136833  ...     False               False   en
1  Sun Dec 20 04:12:58 +0000 2020  1340510479861616643  1340510479861616643  ...     False               False   en
2  Sun Dec 20 15:06:34 +0000 2020  1340674963452391426  1340674963452391426  ...     False               False   en
3  Mon Dec 14 04:05:03 +0000 2020  1338334158205169664  1338334158205169664  ...     False               False   en
4  Mon Dec 14 21:38:14 +0000 2020  1338599202125803521  1338599202125803521  ...     False               False   en

我用来测试它的最少工作代码

from twython import Twython
import json
import pandas as pd

# --- credentials ---

with open("twitter_credentials.json","r") as file:
    creds = json.load(file)
CONSUMER_KEY    = creds['CONSUMER_KEY']
CONSUMER_SECRET = creds['CONSUMER_SECRET']

#import os
#CONSUMER_KEY    = os.getenv('TWITTER_KEY')
#CONSUMER_SECRET = os.getenv('TWITTER_SECRET')

# --- main ---

client = Twython(CONSUMER_KEY,CONSUMER_SECRET)

data = client.search(q='#python',count=5)

#print(data.keys())  # 'statuses','search_metadata'

# save in JSON

with open('tweets_python.json',fh)

# use directly with pandas

df = pd.DataFrame(data['statuses'])

print(df)

顺便说一句:

如果您想保留许多结果,您的字典 data = {} 可能会很有用

data = {}

data['python'] = client.search(q='#python',...)
data['php']    = client.search(q='#php',...)
data['java']   = client.search(q='#java',...)

并将其保存在单独的 JSON 文件中

for key,value in data.items():
    filename = f'tweets_{key}.json'
    with open(filename,'w') as fh:
        json.dump(value,fh)

或以分开的DataFrames

打开
all_dfs = {}

for key,value in data.items():
    all_dfs[key] = pd.DataFrame(value['statuses'])

for key,df in all_dfs.items():
    print('dataframe for:',key)
    print(df)

我用来测试它的最少工作代码

from twython import Twython
import json
import pandas as pd

# --- credentials ---

with open("twitter_credentials.json",CONSUMER_SECRET)

data = {}

data['python'] = client.search(q='#python',count=5)
data['php']    = client.search(q='#php',count=5)
data['java']   = client.search(q='#java',count=5)

# save in JSON

for key,value in data.items():
    filename = f'tweets_{key}.json'
    print('saving',filename)

    with open(filename,fh)

# use directly with pandas

all_dfs = {}

for key,key)
    print(df)