问题描述
我正在尝试为 documentation 后面的 rpart 编写自己的拆分函数。
即使使用交替逻辑回归示例,我也遇到了问题。以下函数基本上是从文档中复制粘贴的。
## Initialization Function
loginit <- function(y,offset,parms,wt)
{
if (is.null(offset)) offset <- 0
if (any(y != 0 & y != 1)) stop ('response must be 0/1')
sfun <- function(yval,dev,wt,ylevel,digits ) {
paste("events=",round(yval[,1]),",coef= ",format(signif(yval[,2],digits)),deviance=",format(signif(dev,sep = '')}
# Not in documentation - added to be able to call rpart.plot
tfun <- function(yval,digits,n,use.n) {
if (use.n)
paste0(format(yval[,1],digits),"\nn=",n)
else format(yval,digits)
}
environment(sfun) <- .GlobalEnv
list(y = cbind(y,offset),parms = 0,numresp = 2,numy = 2,summary = sfun,text = tfun)
}
## Evaluation Function
logeval <- function(y,parms)
{
tfit <- glm(y[,1] ~ offset(y[,2]),binomial,weight = wt)
list(label= c(sum(y[,tfit$coef),deviance = tfit$deviance)
}
## Splitting Function
logsplit <- function(y,x,continuous)
{
if (continuous) {
# continuous x variable: do all the logistic regressions
n <- nrow(y)
goodness <- double(n-1)
direction <- goodness
temp <- rep(0,n)
for (i in 1:(n-1)) {
temp[i] <- 1
if (x[i] != x[i+1]) {
tfit <- glm(y[,1] ~ temp + offset(y[,weight = wt)
goodness[i] <- tfit$null.deviance - tfit$deviance
direction[i] <- sign(tfit$coef[2])
}
}
} else {
# Categorical X variable
# First,find out what order to put the categories in,which
# will be the order of the coefficients in this model
tfit <- glm(y[,1] ~ factor(x) + offset(y[,2]) - 1,weight = wt)
ngrp <- length(tfit$coef)
direction <- rank(rank(tfit$coef) + runif(ngrp,0.1)) #break ties
# breaking ties -- if 2 groups have exactly the same p-hat,it
# does not matter which order I consider them in. And the calling
# routine wants an ordering vector.
#
xx <- direction[match(x,sort(unique(x)))] #relabel from small to large
goodness <- double(length(direction) - 1)
for (i in 1:length(goodness)) {
tfit <- glm(y[,1] ~ I(xx > i) + offset(y[,weight = wt)
goodness[i] <- tfit$null.deviance - tfit$deviance
}
}
list(goodness=goodness,direction=direction)
}
我只将函数 tfun
添加到 loginit
函数中,以便能够在最终模型上调用 rpart.plot
。
文档中没有应用这些函数。然而,使用稍微修改的目标并且仅使用区域作为自变量,可以使用来自同一文档的方差分析示例数据。
## Packages
if("rpart" %in% rownames(installed.packages()) == FALSE) {
install.packages("rpart")
}
library(rpart)
if("rpart.plot" %in% rownames(installed.packages()) == FALSE) {
install.packages("rpart.plot")
}
library(rpart.plot)
if("plyr" %in% rownames(installed.packages()) == FALSE) {
install.packages("plyr")
}
library(plyr)
if("dplyr" %in% rownames(installed.packages()) == FALSE) {
install.packages("dplyr")
}
library(dplyr)
## Data
mystate <- data.frame(state.x77,region=state.region)
names(mystate) <- casefold(names(mystate)) #remove mixed case
## Model
ulist <- list(eval = logeval,split = logsplit,init = loginit)
fit1 <- rpart(I(murder >= mean(mystate$murder)) ~ region,data = mystate,method = ulist,minbucket = 10)
## Plot
rpart.plot(fit1)
模型有几个问题,如下图所示。 north Central
区域出现在左侧,即使它应该在第一次拆分时出现在右侧。其中一片叶子的观察值少于 minbucket
设置的最小值,并且叶子中的观察值加起来没有达到应有的 50。
文本输出确认图像。
> fit1
n= 50
node),split,deviance,yval
* denotes terminal node
1) root 50 68.994380 23
2) region=northeast,South,West 38 52.679190 19
4) region=northeast,north Central 9 6.278978 1 *
5) region=South 16 15.442480 13 *
3) region=north Central 12 15.276340 4 *
通过调查模型的预测,可以看到有 4 个终端节点。
mystate$predicted_node = fit1$where
mystate %>%
group_by(predicted_node) %>%
summarise(n = n(),events = sum(murder >= mean(mystate$murder)))
# A tibble: 4 x 3
predicted_node n events
<int> <int> <int>
1 2 13 5
2 3 9 1
3 4 16 13
4 5 12 4
善良与方向
goodness
和 direction
的值及其在 logsplit
函数的分类变量部分中的顺序显然存在问题。连续变量工作得很好。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)