问题描述
我已经使用cpu在C ++中实现了一个网络,并且我正在尝试使用GPU和python一起对其进行训练。我面临的问题是输入量很大(稀疏),大约有50000个输入神经元,通常只有30个被激活。
我的模型如下:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None,24576) 0
__________________________________________________________________________________________________
input_2 (InputLayer) (None,24576) 0
__________________________________________________________________________________________________
dense_1 (Dense) (None,256) 6291712 input_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None,256) 6291712 input_2[0][0]
__________________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None,256) 0 dense_1[0][0]
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None,256) 0 dense_2[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None,512) 0 leaky_re_lu_1[0][0]
leaky_re_lu_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None,32) 16416 concatenate_1[0][0]
__________________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None,32) 0 dense_3[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (None,32) 1056 leaky_re_lu_3[0][0]
__________________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None,32) 0 dense_4[0][0]
__________________________________________________________________________________________________
dense_5 (Dense) (None,1) 33 leaky_re_lu_4[0][0]
==================================================================================================
Total params: 12,600,929
Trainable params: 12,929
Non-trainable params: 0
我还试图将3亿个输入/输出值输入到我的网络中。 不用说,这些数据太多了,无法全部放入我的GPU。
出于速度目的,我生成每个代表约100000个输入的稀疏矩阵并将它们保存在内存中(约50Gb)。我可以轻松加载它们,而不会像这样损失很多速度:
# loads both the inputs and the output for the given chunk (100000 inputs/outputs) from the memory
trainX1,trainX2,trainY = readNumpyChunkAndCreateInput(chunk)
我用它来训练我的网络,就像这样:
for chunk in chunks:
trainX1,trainY = readNumpyChunkAndCreateInput(chunk)
_res = model.fit([trainX1,trainX2],trainY,epochs=1,steps_per_epoch=1,verbose=0)
loss = list(_res.history.values())[0]
totalLoss += loss[0]
显然,这绝不是最佳方法。我知道Keras / TensorFlow中有一种叫做data generators
的东西,但是可悲的是我不知道如何在特定情况下使用它们,因为所有教程都处理密集输入。
如果有人可以在这里帮助我,我感到非常高兴!
问候, 芬兰人
编辑1
我加载数据的方式:
filePath = os.path.abspath(os.path.dirname(sys.argv[0]))
path = filePath + "\\data\\" + name + "\\"
indices1 = np.load(path + 'indices1.npy')
indices2 = np.load(path + 'indices2.npy')
outputs = np.load(path + 'outputs.npy')
Meta = open(path + 'Meta.txt',"r")
MetaInf = Meta.readlines()[0].split(" ")
Meta.close()
entry1Count = int(MetaInf[0])
entry2Count = int(MetaInf[1])
lineCount = int(MetaInf[2])
values1 = tf.ones(entry1Count)
values2 = tf.ones(entry2Count)
shape = (lineCount,6 * 64 * 64)
trainX1 = tf.SparseTensor(
indices=indices1,values=values1,dense_shape=shape
)
trainX2 = tf.SparseTensor(
indices=indices2,values=values2,dense_shape=shape
)
return trainX1,outputs
解决方法
我写了一个小的生成器函数,可以适应您的用例。
import os
def gen():
paths = os.listdir('temp_data') # path of the directory
for path in paths:
file_path = os.path.join('temp_data',path)
x = np.load(file_path)
y = np.load(file_path),z = np.load(file_path)
# Your logic
#
#
#
yield (x,y,z) # Three tensors/numpy arrays. In your case trainx1,trainx2,outputs.
在tf.data.Dataset中使用生成器的代码:
dataset = tf.data.Dataset.from_generator(gen,(tf.float32,tf.float32,tf.float32))
dataset = dataset.prefetch(2)
预取允许提前存储下一批,以消除任何延迟。 您可以使用此数据集传递给fit命令,也可以使用自定义训练循环。
epochs = 100
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step,(x1_batch_train,x2_batch_train,y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass,which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model([x1_batch_train,x2_batch_train],training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train,logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value,model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads,model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step,float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))