了解gbm生存预测模型

问题描述

我是使用和理解ML方法的新手，目前正在使用R中的gbm包进行生存分析。

我很难理解生存预测模型的某些输出。我已经查看了this教程和this帖子，但仍然在理解输出的生存预测模型时遇到麻烦。

这是我基于示例数据进行分析的代码：

rm(list=ls(all=TRUE))
library(randomForestSRC)
library(gbm)
library(survival)
library(Hmisc)

data(pbc,package="randomForestSRC")
data <- na.omit(pbc)

set.seed(9512)
train <- sample(1:nrow(data),round(nrow(data)*0.7))
data.train <- data[train,]
data.test <- data[-train,]

set.seed(9741)
model <- gbm(Surv(days,status)~.,data.train,interaction.depth=2,shrinkage=0.01,n.trees=500,distribution="coxph",cv.folds = 5)

summary(model)

best.iter <- gbm.perf(model,plot.it = TRUE,method = 'cv',overlay = TRUE) #to get the optimal number of Boosting iterations
best.iter

#Us the best number of tree to produce predicted values for each observation in newdata 
# return a vector of prediction on n.trees indicting log hazard scale.f(x)
# By default the predictions are on log hazard scale for coxph
# proportional hazard model assumes h(t|x)=lambda(t)*exp(f(x)).
# estimate the f(x) component of the hazard function
pred.train <- predict(object=model,newdata=data.train,n.trees = best.iter)
pred.test <- predict(object=model,newdata=data.test,n.trees = best.iter)


#trainig set
Hmisc::rcorr.cens(-pred.train,Surv(data.train$days,data.train$status))
#val set
Hmisc::rcorr.cens(-pred.test,Surv(data.test$days,data.test$status))

# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days,#The survival times.
                           delta=data.train$status,#The censoring indicator
                           f.x=pred.train,#The predicted values of the regression model on the log hazard scale.
                           t.eval = data.train$days,#Values at which the baseline hazard will be evaluated
                           cumulative = TRUE,#If TRUE the cumulative survival function will be computed
                           smooth = FALSE)          #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.

basehaz.cum

#Estimation of survival rate of all:
surv.rate <- exp(-exp(pred.train)*basehaz.cum)
surv.rate

res_train <- data.train
# predicted outcome for train set
res_train$pred <- pred.train
res_train$survival_rate <- surv.rate
res_train


# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.test$days,#The survival times.
                           delta=data.test$status,#The censoring indicator
                           f.x=pred.test,#The predicted values of the regression model on the log hazard scale.
                           t.eval = data.test$days,#If TRUE the cumulative survival function will be computed
                           smooth = FALSE)          #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.

basehaz.cum
#Estimation of survival rate of all at specified time is:
surv.rate <- exp(-exp(pred.test)*basehaz.cum)
surv.rate

res_test <- data.test
# predicted outcome for test set
res_test$pred <- pred.test
res_test$survival_rate <- surv.rate
res_test

#--------------------------------------------------
#Estimate survival rate at time of interest

# Specify time of interest
time.interest <- sort(unique(data.train$days[data.train$status==1]))

# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days,#The predicted values of the regression model on the log hazard scale.
                           t.eval = time.interest,#If TRUE the cumulative survival function will be computed
                           smooth = FALSE)          #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.


#For individual $i$ in test set,estimation of survival function is:
surf.i <- exp(-exp(pred.test[1])*basehaz.cum) #survival rate

#Estimation of survival rate of all at specified time is:
specif.time <- time.interest[10]
surv.rate <- exp(-exp(pred.test)*basehaz.cum[10])
cat("Survival Rate of all at time",specif.time,"\n")
print(surv.rate)

从predict函数返回的输出表示危险函数的f(x)组件（h（t | x）= lambda（t）* exp（f（x）））。

我的问题：

•对于是否可以在此处计算危险比有些困惑？

•想知道如何将人群分为低风险和高风险人群？我可以依靠危险函数的估计f（x）分量来为训练集建立评分系统吗？我的目标是建立一个评分系统，在该系统中，我可以显示低风险和高风险人群的KM图，以进行训练和测试。

•如何构建校准曲线图，以绘制训练集和测试集的观察到的生存期与预测生存期？

解决方法

Amer。感谢您阅读我的教程！

正如您提到的那样，“从predict函数返回的输出表示危险函数（f(x)的{{1}}组件”，也许我们需要了解危险函数，即h（t | x）。

在此之前，请确保您具有生存分析的基本知识。如果不是，建议阅读大写的post。我认为该帖子可以帮助您解决问题。

回到您的问题：

确切地说，我们可以通过调用h(t|x)=lambda(t)*exp(f(x))函数来获得对数刻度的风险比。因此，可以通过predict来计算危险比。
当然！根据危险比的值，我们可以将人群分为低风险和高风险人群。另外，您可以使用危险比的中位数作为临界值。我认为截止值应该从训练集中得出，然后在测试集中进行测试。如果您的模型有效，则针对低风险和高风险人群的KM图将有显着差异（通过对数秩检验进行统计）。

boosting gbm machine-learning r r survival-analysis