移动块自举的覆盖概率问题

问题描述

我使用时间序列数据将移动块引导程序 (MBB) 应用于回归模型。当我计算从 MBB 导出的估计量的覆盖概率时，结果是异常的，除了一个系数（x1 的系数设置为连续变量）。鉴于 MBB 是一种完善的方法（参见 https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf 和 https://en.wikipedia.org/wiki/Bootstrapping_(statistics)），我想知道我的代码是否有问题。我感谢任何输入！

set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30,ar=0.7) {
  ar.epsilon <- arima.sim(list(order = c(1,0),ar = 0.7),n = size,sd=2)
  x1=rnorm(size)
  x2=sample(1:5,size,replace = TRUE,prob = c(0.2,0.2,0.2))
  x3=rbinom(size,1,0.5)
  y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous 
                                                           #predictor x1,ordinal predictor
                                                           #x2 and binary predictor x3
  data.frame(time=1:size,x1=x1,x2=x2,x3=x3,y=y)}

#A time series
tdat <- tsfunc3()

# Block length derived from the data based on the approach proposed by Politis & White 
#(2003): 
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1),function(i) seq(i,i+b-1))

#MBB for intercept estimator
IntMbb <- function() { 
  take.blocks <- sample(1:28,10,replace=TRUE)
  newdat <- tdat[unlist(blocks[take.blocks]),]
  x1 <- unlist(newdat["x1"])
  x2 <- unlist(newdat["x2"])
  x3 <- unlist(newdat["x3"])
  y <- unlist(newdat["y"])
  regmbb <- lm(y ~ x1 + x2 + x3)
  confint(regmbb,"(Intercept)",level = 0.95)
}

#MBB for x1 coefficient estimator
B1Mbb <- function() { 
  take.blocks <- sample(1:28,"x1",level = 0.95)
}

#MBB for x2 coefficient estimator
B2Mbb <- function() { 
  take.blocks <- sample(1:28,"x2",level = 0.95)
}

#MBB for x3 coefficient estimator
B3Mbb <- function() { 
  take.blocks <- sample(1:28,"x3",level = 0.95)
}

#Replications
set.seed(47) 
R <- 100
int.mbb <- replicate(R,IntMbb(),simplify=FALSE)
b1.mbb <- replicate(R,B1Mbb(),simplify=FALSE)
b2.mbb <- replicate(R,B2Mbb(),simplify=FALSE)
b3.mbb <- replicate(R,B3Mbb(),simplify=FALSE)

#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb,function(x,y) x[grep(y,rownames(x)),],"Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34

#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb,"x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9

#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb,"x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38

#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb,"x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33

如您所见，只有 x1 系数估计器的覆盖概率是可以的。那么我的代码有什么问题吗？还是跟MBB本身有关系？

解决方法

您并没有真正评估引导程序的覆盖概率。您需要从自举统计量中建立置信区间，而不是从在自举样本上运行的参数模型中建立置信区间。这就是我要怎么做。

首先，我们可以生成数据：

set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1),function(i) seq(i,i+b-1))

#A time series
tdat <- tsfunc3(size=n,ar=.7)

接下来，我们可以编写一个将引导的函数。此函数生成 bootstrap 样本，运行回归并保存系数。

bsfun <- function(data,blocks){
  samp.data <- data[sample(1:length(blocks),length(blocks),replace=TRUE),]
  mod <- lm(y ~ x1 + x2 + x3,data=samp.data)
  coef(mod)
}

接下来，我们可以多次运行该函数。请注意，要生成可靠的 95% 百分位置信区间，您应该拥有大约 1500-2500 个引导统计数据。您尝试表征的分位数在尾部越远，您需要的引导样本就越多。因此，下面的代码生成一组自举系数：

out <- t(replicate(1000,bsfun(data=tdat,blocks=blocks)))

从这组 bootstrap 统计数据中，我们可以得出一个单一的置信区间。

ci1 <- t(apply(out,2,quantile,probs=c(.025,.975),na.rm=TRUE))
#                   2.5%     97.5%
# (Intercept) -0.3302237 10.258229
# x1          -1.7577214  2.301975
# x2          -0.8016478  2.049435
# x3          -3.0723869  6.190383

如果您想调查这些区间的覆盖概率，您必须多次执行我在上面所做的操作（我们将执行 100 次，但为了获得更好的估计值，您可能需要执行更多操作）。然后我们可以编写一个小函数来评估一组估计的覆盖范围：

eval_cover <- function(true = c(5,.25,.4,.8),obs){
  out <- as.numeric(obs[,1] < true & obs[,2] > true)
  names(out) <- rownames(obs)
  out 
}

然后，您可以将该函数应用于您生成的每个自举置信区间。使用 rowMeans() 函数将获得覆盖 1/0 值的平均值，这将是覆盖概率。在这种情况下，仅使用 100 个区间，覆盖率为 100%。

rowMeans(sapply(outci,function(x)eval_cover(obs=x)))
# (Intercept)          x1          x2          x3 
#           1           1           1           1

r r regression simulation time-series