PyTorch:辍学?会导致训练+验证的模型收敛不同V.仅训练

问题描述

我们面临一个非常奇怪的问题。我们将完全相同的模型测试为两个不同的“执行”设置。在第一种情况下,在给定一定数量的时期的情况下,我们使用小批量训练一个时期,然后按照相同的标准对验证集进行测试。然后,我们进入下一个时代。显然,在每个训练纪元之前,我们都使用model.train(),在验证之前,我们要启用model.eval()。

然后,我们采用完全相同的模型(相同的init,相同的数据集,相同的时期等),并且在每个时期之后我们都对其进行训练而无需验证。

仅查看训练集的性能,我们就会发现,即使我们固定了所有种子,这两个训练过程也会以不同的方式发展,并产生完全不同的度量结果(损失,准确性等)。具体来说,仅训练过程的执行效果较差。

我们还观察到以下几点:

  • 这不是可重复性问题,因为多次执行 相同的过程会产生完全相同的结果(这是 预期);
  • 删除该缺失,看来问题消失了;
  • Batchnorm1d层,它们之间仍然具有不同的行为 培训和评估,似乎工作正常;
  • 如果我们从培训转移到TPU到CPU,问题仍然会发生。 我们正在尝试Pythorch 1.6,Pythorch每晚,XLA 1.6。

我们在解决这一问题上整整失去了一天(不,我们无法避免使用辍学)。有谁知道如何解决这个事实吗?

非常感谢您!

p.s。这里是用于培训的代码(在CPU上)。

def sigmoid(x):
    return 1 / (1 + torch.exp(-x))


def _run(model,EPOCHS,training_data_in,validation_data_in=None):
    
    def train_fn(train_dataloader,model,optimizer,criterion):

        running_loss = 0.
        running_accuracy = 0.
        running_tp = 0.
        running_tn = 0.
        running_fp = 0.
        running_fn = 0.
        
        model.train()

        for batch_idx,(ecg,spo2,labels) in enumerate(train_dataloader,1):

            optimizer.zero_grad() 
                
            outputs = model(ecg)

            loss = criterion(outputs,labels)
                        
            loss.backward() # calculate the gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(),0.5)
            optimizer.step() # update the network weights
                                                
            running_loss += loss.item()
            predicted = torch.round(sigmoid(outputs.data)) # here determining the sigmoid,not included in the model
            
            running_accuracy += (predicted == labels).sum().item() / labels.size(0)   
            
            fp = ((predicted - labels) == 1.).sum().item() 
            fn = ((predicted - labels) == -1.).sum().item()
            tp = ((predicted + labels) == 2.).sum().item()
            tn = ((predicted + labels) == 0.).sum().item()
            running_tp += tp
            running_fp += fp
            running_tn += tn
            running_fn += fn
            
        retval = {'loss':running_loss / batch_idx,'accuracy':running_accuracy / batch_idx,'tp':running_tp,'tn':running_tn,'fp':running_fp,'fn':running_fn
                }
            
        return retval
            

        
    def valid_fn(valid_dataloader,criterion):

        running_loss = 0.
        running_accuracy = 0.
        running_tp = 0.
        running_tn = 0.
        running_fp = 0.
        running_fn = 0.

        model.eval()
        
        for batch_idx,labels) in enumerate(valid_dataloader,1):

            outputs = model(ecg)

            loss = criterion(outputs,labels)
            
            running_loss += loss.item()
            predicted = torch.round(sigmoid(outputs.data)) # here determining the sigmoid,not included in the model

            running_accuracy += (predicted == labels).sum().item() / labels.size(0)  
            
            fp = ((predicted - labels) == 1.).sum().item()
            fn = ((predicted - labels) == -1.).sum().item()
            tp = ((predicted + labels) == 2.).sum().item()
            tn = ((predicted + labels) == 0.).sum().item()
            running_tp += tp
            running_fp += fp
            running_tn += tn
            running_fn += fn
            
        retval = {'loss':running_loss / batch_idx,'fn':running_fn
                }
            
        return retval
    
    
    
    # Defining data loaders

    train_dataloader = torch.utils.data.DataLoader(training_data_in,batch_size=BATCH_SIZE,shuffle=True,num_workers=1)
    
    if validation_data_in != None:
        validation_dataloader = torch.utils.data.DataLoader(validation_data_in,shuffle=False,num_workers=1)


    # Defining the loss function
    criterion = nn.BCEWithLogitsLoss()
    
    
    # Defining the optimizer
    import torch.optim as optim
    optimizer = optim.AdamW(model.parameters(),lr=3e-4,amsgrad=False,eps=1e-07) 


    # Training code
    
    metrics_history = {"loss":[],"accuracy":[],"precision":[],"recall":[],"f1":[],"specificity":[],"accuracy_bis":[],"tp":[],"tn":[],"fp":[],"fn":[],"val_loss":[],"val_accuracy":[],"val_precision":[],"val_recall":[],"val_f1":[],"val_specificity":[],"val_accuracy_bis":[],"val_tp":[],"val_tn":[],"val_fp":[],"val_fn":[],}
    
    train_begin = time.time()
    for epoch in range(EPOCHS):
        start = time.time()

        print("EPOCH:",epoch+1)

        train_metrics = train_fn(train_dataloader=train_dataloader,model=model,optimizer=optimizer,criterion=criterion)
        
        metrics_history["loss"].append(train_metrics["loss"])
        metrics_history["accuracy"].append(train_metrics["accuracy"])
        metrics_history["tp"].append(train_metrics["tp"])
        metrics_history["tn"].append(train_metrics["tn"])
        metrics_history["fp"].append(train_metrics["fp"])
        metrics_history["fn"].append(train_metrics["fn"])
        
        precision = train_metrics["tp"] / (train_metrics["tp"] + train_metrics["fp"]) if train_metrics["tp"] > 0 else 0
        recall = train_metrics["tp"] / (train_metrics["tp"] + train_metrics["fn"]) if train_metrics["tp"] > 0 else 0
        specificity = train_metrics["tn"] / (train_metrics["tn"] + train_metrics["fp"]) if train_metrics["tn"] > 0 else 0
        f1 = 2*precision*recall / (precision + recall) if precision*recall > 0 else 0
        metrics_history["precision"].append(precision)
        metrics_history["recall"].append(recall)
        metrics_history["f1"].append(f1)
        metrics_history["specificity"].append(specificity)
        
        
        
        if validation_data_in != None:    
            # Calculate the metrics on the validation data,in the same way as done for training
            with torch.no_grad(): # don't keep track of the info necessary to calculate the gradients

                val_metrics = valid_fn(valid_dataloader=validation_dataloader,criterion=criterion)

                metrics_history["val_loss"].append(val_metrics["loss"])
                metrics_history["val_accuracy"].append(val_metrics["accuracy"])
                metrics_history["val_tp"].append(val_metrics["tp"])
                metrics_history["val_tn"].append(val_metrics["tn"])
                metrics_history["val_fp"].append(val_metrics["fp"])
                metrics_history["val_fn"].append(val_metrics["fn"])

                val_precision = val_metrics["tp"] / (val_metrics["tp"] + val_metrics["fp"]) if val_metrics["tp"] > 0 else 0
                val_recall = val_metrics["tp"] / (val_metrics["tp"] + val_metrics["fn"]) if val_metrics["tp"] > 0 else 0
                val_specificity = val_metrics["tn"] / (val_metrics["tn"] + val_metrics["fp"]) if val_metrics["tn"] > 0 else 0
                val_f1 = 2*val_precision*val_recall / (val_precision + val_recall) if val_precision*val_recall > 0 else 0
                metrics_history["val_precision"].append(val_precision)
                metrics_history["val_recall"].append(val_recall)
                metrics_history["val_f1"].append(val_f1)
                metrics_history["val_specificity"].append(val_specificity)


            print("  > Training/validation loss:",round(train_metrics['loss'],4),round(val_metrics['loss'],4))
            print("  > Training/validation accuracy:",round(train_metrics['accuracy'],round(val_metrics['accuracy'],4))
            print("  > Training/validation precision:",round(precision,round(val_precision,4))
            print("  > Training/validation recall:",round(recall,round(val_recall,4))
            print("  > Training/validation f1:",round(f1,round(val_f1,4))
            print("  > Training/validation specificity:",round(specificity,round(val_specificity,4))
        else:
            print("  > Training loss:",4))
            print("  > Training accuracy:",4))
            print("  > Training precision:",4))
            print("  > Training recall:",4))
            print("  > Training f1:",4))
            print("  > Training specificity:",4))


        print("Completed in:",round(time.time() - start,1),"seconds \n")

    print("Training completed in:",round((time.time()- train_begin)/60,"minutes")    

    
    
    # Save the model weights
    torch.save(model.state_dict(),'./nnet_model.pt')
    
    
    # Save the metrics history
    torch.save(metrics_history,'training_history')

下面是初始化模型并设置种子的函数,该函数在每次执行“ _run”的代码之前调用:

def reinit_model():
    torch.manual_seed(42)
    np.random.seed(42)
    random.seed(42)
    net = Net() # the model
    return net

解决方法

好的,我发现了问题。 该问题由以下事实决定:显然,运行评估会更改一些随机种子,这会影响训练阶段。

解决方案如下:

  • 在函数“ _run()”的开头,将所有种子状态设置为所需的值,例如42。然后,将这些种子保存到磁盘。
  • 在函数“ train_fn()”的开头,从磁盘读取种子状态,然后进行设置
  • 在函数“ train_fn()”的末尾,将种子状态保存到磁盘中

例如,在具有XLA的TPU上运行,必须使用以下说明:

  • 函数“ _run()”的开头: xm.set_rng_state(42) xm.save(xm.get_rng_state(),'xm_seed')
  • 在函数“ train_fn()”的开头: xm.set_rng_state(torch.load('xm_seed'),device = device)(您也可以在此处打印种子以用于验证) xm.master_print(xm.get_rng_state()
  • 函数“ train_fn_()”结尾: xm.save(xm.get_rng_state(),'xm_seed')

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...