如何在tensorflow中处理大量数据？

对于我的项目,我有大量的数据,大约60GB传播到npy文件,每个文件大约1GB,每个包含大约750k记录和标签.

每条记录是345 float32,标签是5 float32.

我也阅读了tensorflow数据集文档和队列/线程文档,但我无法弄清楚如何最好地处理训练输入,然后如何保存模型和权重以供将来预测.

我的模型很简单,它看起来像这样：

x = tf.placeholder(tf.float32, [None, 345], name='x')
y = tf.placeholder(tf.float32, [None, 5], name='y')
wi, bi = weight_and_bias(345, 2048)
hidden_fc = tf.nn.sigmoid(tf.matmul(x, wi) + bi)
wo, bo = weight_and_bias(2048, 5)
out_fc = tf.nn.sigmoid(tf.matmul(hidden_fc, wo) + bo)
loss = tf.reduce_mean(tf.squared_difference(y, out_fc))
train_op = tf.train.AdamOptimizer().minimize(loss)

我训练神经网络的方式是以随机顺序一次读取一个文件,然后使用混乱的numpy数组索引每个文件并手动创建每个批次以使用Feed_dict提供train_op.从我读到的一切来看,这是非常低效的,我应该以某种方式用数据集或队列和线程替换它,但正如我所说的文档没有帮助.

那么,在tensorflow中处理大量数据的最佳方法是什么？

另外,作为参考,我的数据在2个操作步骤中保存为numpy文件：

with open('datafile1.npy', 'wb') as fp:
    np.save(data, fp)
    np.save(labels, fp)

解决方法:

npy文件的实用程序确实在内存中分配整个数组.我建议你将所有numpy数组转换为TFRecords format并在训练中使用这些文件.这是在张量流中读取大数据集的最有效方法之一.

转换为TFRecords

def array_to_tfrecords(X, y, output_file):
  feature = {
    'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())),
    'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten()))
  }
  example = tf.train.Example(features=tf.train.Features(feature=feature))
  serialized = example.SerializetoString()

  writer = tf.python_io.TFRecordWriter(output_file)
  writer.write(serialized)
  writer.close()

处理图像的完整示例可以是found here.

阅读TFRecordDataset

def parse_proto(example_proto):
  features = {
    'X': tf.FixedLenFeature((345,), tf.float32),
    'y': tf.FixedLenFeature((5,), tf.float32),
  }
  parsed_features = tf.parse_single_example(example_proto, features)
  return parsed_features['X'], parsed_features['y']

def read_tfrecords(file_names=("file1.tfrecord", "file2.tfrecord", "file3.tfrecord"),
                   buffer_size=10000,
                   batch_size=100):
  dataset = tf.contrib.data.TFRecordDataset(file_names)
  dataset = dataset.map(parse_proto)
  dataset = dataset.shuffle(buffer_size)
  dataset = dataset.repeat()
  dataset = dataset.batch(batch_size)
  return tf.contrib.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)

数据手册可以是found here.

如何在tensorflow中处理大量数据？

相关文章