在分类变量上使用用户定义的拆分函数时处理不正确的 rpart 输出善良与方向

问题描述

我正在尝试为 documentation 后面的 rpart 编写自己的拆分函数。

即使使用交替逻辑回归示例，我也遇到了问题。以下函数基本上是从文档中复制粘贴的。

## Initialization Function
loginit <- function(y,offset,parms,wt)
{
  if (is.null(offset)) offset <- 0
  if (any(y != 0 & y != 1)) stop ('response must be 0/1')
  sfun <- function(yval,dev,wt,ylevel,digits ) {
    paste("events=",round(yval[,1]),",coef= ",format(signif(yval[,2],digits)),deviance=",format(signif(dev,sep = '')}
  # Not in documentation - added to be able to call rpart.plot
  tfun <- function(yval,digits,n,use.n) {
    if (use.n) 
      paste0(format(yval[,1],digits),"\nn=",n)
    else format(yval,digits)
  }
  environment(sfun) <- .GlobalEnv
  list(y = cbind(y,offset),parms = 0,numresp = 2,numy = 2,summary = sfun,text = tfun)
}

## Evaluation Function
logeval <- function(y,parms)
{
  tfit <- glm(y[,1] ~ offset(y[,2]),binomial,weight = wt)
  list(label= c(sum(y[,tfit$coef),deviance = tfit$deviance)
}


## Splitting Function
logsplit <- function(y,x,continuous)
{
  if (continuous) {
    # continuous x variable: do all the logistic regressions
    n <- nrow(y)
    goodness <- double(n-1)
    direction <- goodness
    temp <- rep(0,n)
    for (i in 1:(n-1)) {
      temp[i] <- 1
      if (x[i] != x[i+1]) {
        tfit <- glm(y[,1] ~ temp + offset(y[,weight = wt)
        goodness[i] <- tfit$null.deviance - tfit$deviance
        direction[i] <- sign(tfit$coef[2])
      }
    }
  } else {
    # Categorical X variable
    # First,find out what order to put the categories in,which
    # will be the order of the coefficients in this model
    tfit <- glm(y[,1] ~ factor(x) + offset(y[,2]) - 1,weight = wt)
    ngrp <- length(tfit$coef)
    direction <- rank(rank(tfit$coef) + runif(ngrp,0.1)) #break ties
    # breaking ties -- if 2 groups have exactly the same p-hat,it
    # does not matter which order I consider them in. And the calling
    # routine wants an ordering vector.
    #
    xx <- direction[match(x,sort(unique(x)))] #relabel from small to large
    goodness <- double(length(direction) - 1)
    for (i in 1:length(goodness)) {
      tfit <- glm(y[,1] ~ I(xx > i) + offset(y[,weight = wt)
      goodness[i] <- tfit$null.deviance - tfit$deviance
    }
  }
  list(goodness=goodness,direction=direction)
}

我只将函数 tfun 添加到 loginit 函数中，以便能够在最终模型上调用 rpart.plot。

文档中没有应用这些函数。然而，使用稍微修改的目标并且仅使用区域作为自变量，可以使用来自同一文档的方差分析示例数据。

## Packages
if("rpart" %in% rownames(installed.packages()) == FALSE) {
  install.packages("rpart")
}
library(rpart)
if("rpart.plot" %in% rownames(installed.packages()) == FALSE) {
  install.packages("rpart.plot")
}
library(rpart.plot)
if("plyr" %in% rownames(installed.packages()) == FALSE) {
  install.packages("plyr")
}
library(plyr)
if("dplyr" %in% rownames(installed.packages()) == FALSE) {
  install.packages("dplyr")
}
library(dplyr)

## Data
mystate <- data.frame(state.x77,region=state.region)
names(mystate) <- casefold(names(mystate)) #remove mixed case

## Model
ulist <- list(eval = logeval,split = logsplit,init = loginit)
fit1 <- rpart(I(murder >= mean(mystate$murder)) ~ region,data = mystate,method = ulist,minbucket = 10)

## Plot
rpart.plot(fit1)

模型有几个问题，如下图所示。 north Central 区域出现在左侧，即使它应该在第一次拆分时出现在右侧。其中一片叶子的观察值少于 minbucket 设置的最小值，并且叶子中的观察值加起来没有达到应有的 50。

文本输出确认图像。

> fit1
n= 50 

node),split,deviance,yval
      * denotes terminal node

1) root 50 68.994380 23  
  2) region=northeast,South,West 38 52.679190 19  
    4) region=northeast,north Central 9  6.278978  1 *
    5) region=South 16 15.442480 13 *
  3) region=north Central 12 15.276340  4 *

通过调查模型的预测，可以看到有 4 个终端节点。

mystate$predicted_node = fit1$where
mystate %>%
  group_by(predicted_node) %>%
  summarise(n = n(),events = sum(murder >= mean(mystate$murder)))

# A tibble: 4 x 3
  predicted_node     n events
           <int> <int>  <int>
1              2    13      5
2              3     9      1
3              4    16     13
4              5    12      4

善良与方向

goodness 和 direction 的值及其在 logsplit 函数的分类变量部分中的顺序显然存在问题。连续变量工作得很好。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

r r rpart

在分类变量上使用用户定义的拆分函数时处理不正确的 rpart 输出 善良与方向

问题描述

善良与方向

解决方法

在分类变量上使用用户定义的拆分函数时处理不正确的 rpart 输出善良与方向