从二进制文件创建Numpy数组的有效方法

问题描述

一些提示

  • 不要使用struct模块。而是使用Numpy的结构化数据类型和fromfile在这里检查:http : //scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files

  • 您可以通过将适当的count =传递给来一次读取所有记录fromfile

像这样(未经测试,但您知道了):

将numpy导入为np

文件=打开(input_file,'rb')
标头= file.read(149)

#...像您一样解析标头...

record_dtype = np.dtype([
    (“时间戳记”,“ <i4”), 
    (“样本​​”,“ <i2”,(样本率,4))
])

数据= np.fromfile(文件,dtype = record_dtype,count =记录数)
#注意:计数可以省略-它只读取整个文件,然后

time_series = data ['timestamp']
t_series = data ['samples'] [:,:,0] .ravel()
x_series = data ['samples'] [:,:,1] .ravel()
y_series = data ['samples'] [:,:,2] .ravel()
z_series = data ['samples'] [:,:,3] .ravel()

解决方法

我有很大的数据集,这些数据集存储在硬盘上的二进制文件中。这是文件结构的示例:

文件头

149 Byte ASCII Header

记录开始

4 Byte Int - Record Timestamp

样品开始

2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample

样品结束

每个记录有122,880个样本,每个文件有713个记录。这样产生的总大小为700,910,521字节。采样率和记录数量有时会有所不同,因此我必须编写代码以检测每个文件的数量。

目前,我用于将此数据导入数组的代码如下所示:

from time import clock
from numpy import zeros,int16,int32,hstack,array,savez
from struct import unpack
from os.path import getsize

start_time = clock()
file_size = getsize(input_file)

with open(input_file,'rb') as openfile:
  input_data = openfile.read()

header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)

for record in xrange(number_of_records):

  time_stamp = array( unpack( '<l',input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ),dtype = int32 )
  unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h',input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] )

  record_t = zeros(sample_rate,dtype=int16)
  record_x = zeros(sample_rate,dtype=int16)
  record_y = zeros(sample_rate,dtype=int16)
  record_z = zeros(sample_rate,dtype=int16)

  for sample in xrange(sample_rate):

    record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
    record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
    record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
    record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]

  time_series = hstack ( ( time_series,time_stamp ) )
  t_series = hstack ( ( t_series,record_t ) )
  x_series = hstack ( ( x_series,record_x ) )
  y_series = hstack ( ( y_series,record_y ) )
  z_series = hstack ( ( z_series,record_z ) )

savez(output_file,t=t_series,x=x_series,y=y_series,z=z_series,time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'

目前每700 MB文件大约需要250秒,在我看来,这是非常高的。有没有更有效的方法可以做到这一点?

最终解决方案

将numpy fromfile方法与自定义dtype一起使用可将运行时间缩短到9秒,比上面的原始代码快27倍。最终代码如下。

from numpy import savez,dtype,fromfile 
from os.path import getsize
from time import clock

start_time = clock()
file_size = getsize(input_file)

openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

record_dtype = dtype( [ ( 'timestamp','<i4' ),( 'samples','<i2',( sample_rate,4 ) ) ] )

data = fromfile(openfile,dtype = record_dtype,count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,1].ravel()
y_series = data['samples'][:,2].ravel()
z_series = data['samples'][:,3].ravel()

savez(output_file,fid=time_series)

end_time = clock()

print 'It took','seconds'