问题描述
我正在使用 Twython 从高音扬声器中抓取数据。我可以成功完成这项工作。但是,为了进一步处理数据,我需要将高音数据保存为 JSON 或任何其他可以用 Pandas 打开的格式。
我想包括抓取结果中的每一列,包括语言位置、转推等。我知道如何为几列执行此操作,但找不到有关如何包含所有列的信息。
import json
credentials = {}
credentials['CONSUMER_KEY'] = '...'
credentials['CONSUMER_SECRET'] = '...'
credentials['ACCESS_TOKEN'] = '...'
credentials['ACCESS_SECRET'] = '...'
# Save the credentials object to file
with open("twitter_credentials.json","w") as file:
json.dump(credentials,file)
# Import the Twython class
from twython import Twython
import json
# Load credentials from json file
with open("twitter_credentials.json","r") as file:
creds = json.load(file)
# Instantiate an object
python_tweets = Twython(creds['CONSUMER_KEY'],creds['CONSUMER_SECRET'])
python_tweets.search(q='#python',result_type='popular',count=5)
OUTPUT:
{'statuses': [{'created_at': 'Mon Dec 14 04:05:03 +0000 2020','id': 1338334158205169664,'id_str': '1338334158205169664','text': '? Hmmm...this looks right,doesn’t it? We’ll give you a hint - the result is meant to be 36!\n\nCan you find the err… ','truncated': True,'entities': {'hashtags': [],'symbols': [],'user_mentions': [],'urls': [{'url': '','expanded_url': '','display_url': 'twitter.com/i/web/status/1…','indices': [117,140]}]},'Metadata': {'result_type': 'popular','iso_language_code': 'en'},'source': '<a href=">','in_reply_to_status_id': None,'in_reply_to_status_id_str': None,'in_reply_to_user_id': None,'in_reply_to_user_id_str': None,and so on
我的问题是:如何将我从推特上获得的数据保存为 json 格式,以便我最近可以用 Pandas 打开它。我基本上只是想以某种方式用熊猫打开它。
我尝试了以下代码:
data= {}
data[python_tweets.search(q='#python',count=5)]
with open("twitter_new.json","w") as file:
json.dump(data,file)
TypeError: unhashable type: 'dict'
data=python_tweets.search(q='#python',count=5)
df = pd.DataFrame(data)
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
解决方法
要保存 search()
的结果,您应该简单地赋值给变量 data = ...
并保存
data = client.search(q='#python',result_type='popular',count=5)
with open('tweets_python.json','w') as fh:
json.dump(data,fh)
但是这个 JSON 具有复杂的结构 - 它有两个不同的子词典 data['statuses']
和 data['search_metadata']
,它们不能一起转换为一个 DataFrame
。但可能您只需要来自 data['statuses']
的值(即使没有保存在文件中)
df = pd.DataFrame(data['statuses'])
print(df)
结果:
created_at id id_str ... retweeted possibly_sensitive lang
0 Sun Dec 20 15:14:21 +0000 2020 1340676922230136833 1340676922230136833 ... False False en
1 Sun Dec 20 04:12:58 +0000 2020 1340510479861616643 1340510479861616643 ... False False en
2 Sun Dec 20 15:06:34 +0000 2020 1340674963452391426 1340674963452391426 ... False False en
3 Mon Dec 14 04:05:03 +0000 2020 1338334158205169664 1338334158205169664 ... False False en
4 Mon Dec 14 21:38:14 +0000 2020 1338599202125803521 1338599202125803521 ... False False en
我用来测试它的最少工作代码
from twython import Twython
import json
import pandas as pd
# --- credentials ---
with open("twitter_credentials.json","r") as file:
creds = json.load(file)
CONSUMER_KEY = creds['CONSUMER_KEY']
CONSUMER_SECRET = creds['CONSUMER_SECRET']
#import os
#CONSUMER_KEY = os.getenv('TWITTER_KEY')
#CONSUMER_SECRET = os.getenv('TWITTER_SECRET')
# --- main ---
client = Twython(CONSUMER_KEY,CONSUMER_SECRET)
data = client.search(q='#python',count=5)
#print(data.keys()) # 'statuses','search_metadata'
# save in JSON
with open('tweets_python.json',fh)
# use directly with pandas
df = pd.DataFrame(data['statuses'])
print(df)
顺便说一句:
如果您想保留许多结果,您的字典 data = {}
可能会很有用
data = {}
data['python'] = client.search(q='#python',...)
data['php'] = client.search(q='#php',...)
data['java'] = client.search(q='#java',...)
并将其保存在单独的 JSON
文件中
for key,value in data.items():
filename = f'tweets_{key}.json'
with open(filename,'w') as fh:
json.dump(value,fh)
或以分开的DataFrames
all_dfs = {}
for key,value in data.items():
all_dfs[key] = pd.DataFrame(value['statuses'])
for key,df in all_dfs.items():
print('dataframe for:',key)
print(df)
我用来测试它的最少工作代码
from twython import Twython
import json
import pandas as pd
# --- credentials ---
with open("twitter_credentials.json",CONSUMER_SECRET)
data = {}
data['python'] = client.search(q='#python',count=5)
data['php'] = client.search(q='#php',count=5)
data['java'] = client.search(q='#java',count=5)
# save in JSON
for key,value in data.items():
filename = f'tweets_{key}.json'
print('saving',filename)
with open(filename,fh)
# use directly with pandas
all_dfs = {}
for key,key)
print(df)