RNN中的梯度累积

问题描述

在运行大型RNN网络时,我遇到了一些内存问题(GPU),但我想使批处理大小保持合理,因此我想尝试进行梯度累积。在一个可以一口气预测输出的网络中,这似乎是不言而喻的,但是在RNN中,您需要为每个输入步骤进行多次前向传递。因此,我担心我的实现无法按预期工作。我从用户albanD的优秀示例here 开始,但是我认为在使用RNN时应对其进行修改。我认为这是因为您对每个序列进行多次前向积累了更多的梯度。

我当前的实现看起来像这样,同时允许在PyTorch 1.6中使用AMP,这似乎很重要-一切都需要在正确的地方调用。请注意,这只是一个抽象版本,可能看起来像很多代码,但主要是注释。

def train(epochs):
    """Main training loop. Loops for `epoch` number of epochs. Calls `process`."""
    for epoch in range(1,epochs + 1):
        train_loss = process("train")
        valid_loss = process("valid")
        # ... check whether we improved over earlier epochs
        if lr_scheduler:
            lr_scheduler.step(valid_loss)
        
def process(do):
    """Do a single epoch run through the DataLoader of the training or validation set. 
       Also takes care of optimizing the model after every `gradient_accumulation_steps` steps.
       Calls `step` for each batch where it gets the loss from."""
    if do == "train":
        model.train()
        torch.set_grad_enabled(True)
    else:
        model.eval()
        torch.set_grad_enabled(False)
    
    loss = 0.
    for batch_idx,batch in enumerate(DataLoaders[do]):
        step_loss,avg_step_loss = step(batch)
        loss += avg_step_loss

        if do == "train":
            if amp:
                scaler.scale(step_loss).backward()

                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    # Unscales the gradients of optimizer's assigned params in-place
                    scaler.unscale_(optimizer)
                    # clip in-place
                    clip_grad_norm_(model.parameters(),2.0)
                    scaler.step(optimizer)
                    scaler.update()
                    model.zero_grad()
            else:
                step_loss.backward()
                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    clip_grad_norm_(model.parameters(),2.0)
                    optimizer.step()
                    model.zero_grad()
        
        # return average loss
        return loss / len(DataLoaders[do])

    def step():
        """Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
        # do stuff... init hidden state and first input etc.
        loss = torch.tensor([0.]).to(device)
        
        for i in range(target_len):
            with torch.cuda.amp.autocast(enabled=amp):
                # overwrite prevIoUs decoder_hidden
                output,decoder_hidden = model(decoder_input,decoder_hidden)

                # compute loss between predicted classes (bs x classes) and correct classes for _this word_
                item_loss = criterion(output,target_tensor[i])

                # We calculate the gradients for the average step so that when
                # we do take an optimizer.step,it takes into account the mean step_loss
                # across batches. So basically (A+B+C)/3 = A/3 + B/3 + C/3
                loss += (item_loss / gradient_accumulation_steps)

            topv,topi = output.topk(1)
            decoder_input = topi.detach()
        
        return loss,loss.item() / target_len

以上内容似乎并不像我希望的那样起作用,也就是说,它仍然很快会遇到内存不足的问题。我认为原因是step已经积累了很多信息,但我不确定。

解决方法

为简单起见,我只会考虑启用amp的梯度累积,而没有放大器的想法是相同的。您呈现的步骤在amp下运行,因此请坚持下去。

step

PyTorch documentation about amp中,您有一个梯度累积的例子。您应该在step内部进行操作。每次运行loss.backward()时,梯度张量就会累积在张量叶中,可以通过optimizer对其进行优化。因此,您的step应该看起来像这样(请参见评论):

def step():
    """Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
    # You should not accumulate loss on `GPU`,RAM and CPU is better for that
    # Use GPU only for calculations,not for gathering metrics etc.
    loss = 0

    for i in range(target_len):
        with torch.cuda.amp.autocast(enabled=amp):
            # where decoder_input is from?
            # I assume there is one in real code
            output,decoder_hidden = model(decoder_input,decoder_hidden)
            # Here you divide by accumulation steps
            item_loss = criterion(output,target_tensor[i]) / (
                gradient_accumulation_steps * target_len
            )


        scaler.scale(item_loss).backward()
        loss += item_loss.detach().item()

        # Not sure what was topv for here
        _,topi = output.topk(1)
        decoder_input = topi.detach()

    # No need to return loss now as we did backward above
    return loss / target_len

无论如何,您detach decoder_input(就像完全没有历史和参数的全新隐藏输入一样,将基于此优化输入,而不是基于所有运行)不需要backward进行中。另外,您可能不需要decoder_hidden,如果它没有传递到网络,则用隐式传递填充有零的torch.tensor

我们还应除以gradient_accumulation_steps * target_len,因为这是在单个优化步骤之前我们将运行的backward个数。

由于您的某些变量定义不正确,我假设您只是对正在发生的事情进行了规划。

此外,如果您希望保留历史记录,则不应该detach decoder_input,在这种情况下,它看起来像这样:

def step():
    """Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
    loss = 0

    for i in range(target_len):
        with torch.cuda.amp.autocast(enabled=amp):
            output,decoder_hidden)
            item_loss = criterion(output,target_tensor[i]) / (
                gradient_accumulation_steps * target_len
            )

        _,topi = output.topk(1)
        decoder_input = topi

        loss += item_loss
    scaler.scale(loss).backward()
    return loss.detach().cpu() / target_len

这有效地通过了RNN多次,并且可能会提高OOM,不确定您在这里追求什么。如果是这种情况,那么您就可以执行AFAIK,因为RNN计算太长而无法放入GPU。

process

仅显示该代码的相关部分,因此它将是:

loss = 0.0
for batch_idx,batch in enumerate(dataloaders[do]):
    # Here everything is detached from graph so we're safe
    avg_step_loss = step(batch)
    loss += avg_step_loss

    if do == "train":
        if (batch_idx + 1) % gradient_accumulation_steps == 0:
            # You can use unscale as in the example in PyTorch's docs
            # just like you did
            scaler.unscale_(optimizer)
            # clip in-place
            clip_grad_norm_(model.parameters(),2.0)
            scaler.step(optimizer)
            scaler.update()
            # IMO in this case optimizer.zero_grad is more readable
            # but it's a nitpicking
            optimizer.zero_grad()

# return average loss
return loss / len(dataloaders[do])

类似问题

[...]在RNN中,您需要为每个输入步骤执行多个前向传递。 因此,我担心自己的实现无法像

没关系。对于每一个前进,您通常应该向后做一个(似乎是这种情况,请参阅步骤以获取可能的选项)。之后,我们(通常)不需要与图相关的损耗,因为我们已经执行了backpropagation,获得了渐变并准备优化参数。

这种损失需要有历史记录,因为它可以追溯到流程循环

在显示的过程中无需调用backward