问题描述
出于空间和安全原因,我正在尝试从具有波纹管结构的 JSON 文件切换到某种二进制格式来存储数据:
{
"Name.Of.Var.1": [
{
"data_time": "time_stamp","value": #,"packet_time": "time_stamp"
},{
"data_time": "time_stamp",.
.
.
],.
.
.
}
我使用 pandas
创建了一个 DataFrame
并将其转换为 3 种不同的文件格式:hdf
、feather
和 parquet
。在查看了一些基准 here 之后,我倾向于 parquet
,因为我需要长时间保留数据。
我目前的目标是将我拥有的数据绘制成图表。我在 data_time
中遇到 parquet
时间戳问题。数据跨越约。 14 小时,feather
和 hdf5
文件都是如此,但由于某种原因,parquet
的时间跨度仅为 40 分钟。
# import modin.pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import dateutil.parser as dtp
# df = pd.read_json('/home/$USER/git-projects/ap_renewables/vestas-simulator-setup/network_sniffer/json_data/20210115/15-01-2021_13-59-42-343227_1.json',orient='index').transpose()
# df.to_hdf('/home/$USER/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.hdf',key='df',mode='w')
# df.to_parquet('/home/$USER/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.parquet')
# df.to_feather('/home/$USER/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.feather')
# print(df)
def checkIfDuplicates(listofElems):
''' Check if given list contains any duplicates '''
i = 0
for elem in listofElems:
i += 1
if listofElems.count(elem) > 1:
print(f'Dupl. Elemets: {elem} at i {i}')
return True
return False
# power = [val['value'] for val in df['turbine.Grid.Production.Power.Actual'].dropna()]
print('============ PARQUET ============')
df1 = pd.read_parquet('/home/$USER/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.parquet')
time1 = [dtp.parse(val['data_time']) for val in df1['turbine.Ambient.WindSpeed.Actual'].dropna()]
time1_sorted = sorted(time1)
# with open('t1_sorted.txt','w') as f:
# for item in sorted(time1):
# f.write(item.strftime('%Y-%m-%dT%H:%M:%s.%f') + '\n')
print(f'Length: {len(time1_sorted)}')
print(f'Start: {time1_sorted[0]}')
print(f'End: {time1_sorted[-1]}')
print(f'Max: {max(time1_sorted)}')
print(f'Time Span: {time1_sorted[-1] - time1_sorted[0]}')
print(f'Duplicates: {checkIfDuplicates(time1_sorted)}')
print(time1_sorted[0],time1_sorted[1])
print('============ FEATHER ============')
df2 = pd.read_feather('/home/$USER/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.feather')
time2 = [dtp.parse(val['data_time']) for val in df2['turbine.Ambient.WindSpeed.Actual'].dropna()]
time2_sorted = sorted(time2)
# with open('t2_sorted.txt','w') as f:
# for item in sorted(time2):
# f.write(item.strftime('%Y-%m-%dT%H:%M:%s.%f') + '\n')
print(f'Length: {len(time2_sorted)}')
print(f'Start: {time2_sorted[0]}')
print(f'End: {time2_sorted[-1]}')
print(f'Time Span: {time2_sorted[-1] - time2_sorted[0]}')
print(f'Duplicates: {checkIfDuplicates(time2_sorted)}')
print(time2_sorted[7623],time2_sorted[7624])
# wind = [val['value'] for val in df['turbine.Ambient.WindSpeed.Actual'].dropna()]
# print(type(time))
# print(type(wind))
# sorting check data
# parquet fails it,16 values out of place
# need to sort by `data_time`
# counter = 0
# checker = 0
# while counter < len(time)-1:
# if time[counter+1] < time[counter]:
# checker += 1
# counter += 1
# print(checker)
# print(len(time))
# print(len(wind))
# # print (f'Power: {len(power)}')
# print(f'Wind: {len(wind)}')
# plt.plot(time,wind,'.',markersize=2)
# plt.xlabel('Time')
# plt.ylabel('Wind Speed (m/s)')
# plt.title('HDF FILE')
# plt.show()
经过一些调查/比较 feather
和 parquet
,似乎 parquet
中的时间戳未排序,并且两个列表中都有重复项。我的测试结果如下:
============ PARQUET ============
Length: 17191
Start: 2020-10-15 23:47:29.815000
End: 2020-10-16 00:30:14.688000
Max: 2020-10-16 00:30:14.688000
Time Span: 0:42:44.873000
Dupl. Elemets: 2020-10-15 23:47:29.815000 at i 1
Duplicates: True
2020-10-15 23:47:29.815000 2020-10-15 23:47:29.815000
============ FEATHER ============
Length: 17191
Start: 2020-10-15 23:47:29.815000
End: 2020-10-16 14:15:20.849000
Time Span: 14:27:51.034000
Dupl. Elemets: 2020-10-16 05:11:19.913000 at i 7624
Duplicates: True
2020-10-16 05:11:19.913000 2020-10-16 05:11:19.913000
我已将两个列表放入文件中并进行了比较,在 t1_sorted.txt 中,很多值重复了 17 次,并且 t2_sorted.txt 中的大量数据丢失了。
feather
和 parquet
都基于相同的数据框。
df = pd.read_json('/home/ystanev/git-projects/ap_renewables/vestas-simulator-setup/network_sniffer/json_data/20210115/15-01-2021_13-59-42-343227_1.json',orient='index').transpose()
df.to_hdf('/home/ystanev/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.hdf',mode='w')
df.to_parquet('/home/ystanev/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.parquet')
df.to_feather('/home/ystanev/Desktop/ap_stuff/data_formats/15-01-2021_13-59-42-343227_1.feather')
我不确定是什么原因造成的。感谢您的帮助。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)