问题描述
我使用 ggeffects::ggemmeans()
从模型中获取预测,但我不知道我是发现了错误还是做错了其他事情。当在模型中使用 factor 变量作为预测变量时,ggemmeans()
的输出在重新调整因子时会变得混乱。
示例
下面有两个场景,a
和 b
,其中我将数据列转换为因子,然后使用 lm()
拟合模型,最后使用 {{1} 计算预测}}。
ggemmeans()
由 reprex package (v0.3.0) 于 2021 年 5 月 3 日创建
当我们比较 library(ggplot2)
library(dplyr)
library(emmeans)
library(ggeffects)
# scenario a
# step a1 -- convert manufacturer col to factor
my_mpg_manuf_as_fac_a <-
mpg %>%
mutate(across(manufacturer,factor))
levels(my_mpg_manuf_as_fac_a$manufacturer) ## the levels are ordered alphabetically
#> [1] "audi" "chevrolet" "dodge" "ford" "honda"
#> [6] "hyundai" "jeep" "land rover" "lincoln" "mercury"
#> [11] "nissan" "pontiac" "subaru" "toyota" "volkswagen"
# step a2 -- model and get predictions
pred_a <-
my_mpg_manuf_as_fac_a %>%
lm(cty ~ manufacturer,data = .) %>%
ggemmeans(terms = "manufacturer")
pred_a
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> audi | 17.61 | [16.25,18.97]
#> dodge | 13.14 | [12.19,14.08]
#> ford | 14.00 | [12.85,15.15]
#> hyundai | 18.64 | [17.10,20.18]
#> land rover | 11.50 | [ 8.62,14.38]
#> mercury | 13.25 | [10.37,16.13]
#> pontiac | 17.00 | [14.42,19.58]
#> volkswagen | 20.93 | [19.82,22.04]
# scenario b
# step b1 -- convert manufacturer col to factor (same as step a1)
my_mpg_manuf_as_fac_b <-
mpg %>%
mutate(across(manufacturer,factor))
# step b2 -- change the order of levels in manufacturer
levels(my_mpg_manuf_as_fac_b$manufacturer) <- sort(levels(my_mpg_manuf_as_fac_b$manufacturer),decreasing = TRUE)
levels(my_mpg_manuf_as_fac_b$manufacturer) ## order of levels is Now reveresed
#> [1] "volkswagen" "toyota" "subaru" "pontiac" "nissan"
#> [6] "mercury" "lincoln" "land rover" "jeep" "hyundai"
#> [11] "honda" "ford" "dodge" "chevrolet" "audi"
# step b3 -- model and get predictions
pred_b <-
my_mpg_manuf_as_fac_b %>%
lm(cty ~ manufacturer,data = .) %>%
ggemmeans(terms = "manufacturer")
pred_b
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> volkswagen | 17.61 | [16.25,18.97]
#> subaru | 13.14 | [12.19,14.08]
#> pontiac | 14.00 | [12.85,15.15]
#> mercury | 18.64 | [17.10,14.38]
#> hyundai | 13.25 | [10.37,16.13]
#> ford | 17.00 | [14.42,19.58]
#> audi | 20.93 | [19.82,22.04]
和 pred_a
时,很容易看出 pred_b
和 Predicted
列中的值保持不变,即使 order95% CI
列中的 em> 个名称已更改。
x
解决方法
您应该改用 factor()
函数来重新调平,因为 levels()
并没有真正看到底层数据。当您使用 levels()
时,您的整个数据会发生变化:audi
变为 volkswagen
,等等。但是通过将原始向量传递给 factor()
,您将保留值本身。
数据:
manufacturers=c("audi","chevrolet","subaru","toyota","volkswagen")
df = data.frame(mpg = runif(length(manufacturers)*2,30,50),manufacturer = rep(manufacturers,each = 2),stringsAsFactors = TRUE)
之前:
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: audi chevrolet subaru toyota volkswagen
之后:
df$manufacturer = factor(df$manufacturer,levels = sort(levels(df$manufacturer),decreasing = T))
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: volkswagen toyota subaru chevrolet audi
比较:
df = data.frame(mpg = runif(length(manufacturers)*2,stringsAsFactors = TRUE)
levels(df$manufacturer) = sort(levels(df$manufacturer),decreasing = T)
> df$manufacturer
[1] volkswagen volkswagen toyota toyota subaru subaru chevrolet chevrolet audi audi
Levels: volkswagen toyota subaru chevrolet audi
重命名了整个向量。