如何从多项式回归模型即按因子变量划分中获得多个预测,而不是焦点预测 如果我只是想对年龄,性别和体重进行多项式回归并控制**而无需按性别划分**,我可以这样做:这是我要获取的情节

问题描述

我想进行多项式回归,以得出每个接近问题的选择的平均频率,除以一个因子(性别:男性/女性)。

背景

我想比较四种奶酪来衡量每个人的受欢迎程度,其中包括四种可能性:切达干酪,马苏里拉干酪,高达干酪和咸味干酪。我出去问200个人要他们喜欢的奶酪。每个人只能从4种类型中选择一种。我最终还收集了一些人口统计信息,包括性别,年龄和体重。

在完成数据收集后,我想看看每种奶酪的受欢迎程度所占的比例(总和为100%)。由于我想控制genderageweight,因此我认为在这里适合使用多项式回归。

但是我也很想知道男性和女性之间的结果有何不同,因此我想在模型中加入gender作为一个因素。如何基于(多项式)模型生成双重预测,该预测将分别获得女性和男性的预测值,以便可以在两个性别级别之间进行比较?

数据

library(truncnorm)
library(tidyverse)

set.seed(999)

cheese_df <-
  tibble(
    age = round(rtruncnorm(
      n = 200,a = 20,b = 80,mean = 25,sd = 25.09
    )),cheese_response = as_factor(sample(
      c("cheddar","mozzarella","gouda","brie"),size = 200,replace = TRUE
    )),gender = sample(c(0,1),replace = TRUE),weight = rtruncnorm(
      n = 200,a = 40,b = 120,mean = 70,sd = 25.09
    )
  )


> cheese_df

## # A tibble: 200 x 4
##      age cheese_response gender weight
##    <dbl> <fct>            <dbl>  <dbl>
##  1    45 cheddar              0   62.2
##  2    32 cheddar              0   45.0
##  3    58 cheddar              1   87.6
##  4    28 brie                 0   68.8
##  5    49 gouda                0   88.2
##  6    29 brie                 1   74.5
##  7    49 cheddar              0   74.0
##  8    27 gouda                1   90.3
##  9    28 brie                 0   56.5
## 10    48 mozzarella           0   72.9
## # ... with 190 more rows

如果我只是想对年龄,性别和体重进行多项式回归并控制**而无需按性别划分**,我可以这样做:

library(nnet)
library(effects)


fit <- nnet::multinom(cheese_response ~ age + gender + weight,data = cheese_df)

average_person_for_control <-
  c(
    age = 50,gender = 0.5,weight = 75
  )

prediction <-
  effects::Effect("age",fit,given.values = average_person_for_control,xlevels = list(age =
                                   c(45,90)))


proportions_for_plot <-
  data.frame(prediction$prob,prediction$lower.prob,prediction$upper.prob) %>% 
  slice(1) %>%
  pivot_longer(.,cols = everything(),names_to = c(".value","response"),names_pattern = "(.*)\\.(.*$)") %>%
  rename("lower_ci" = "L.prob","upper_ci" = "U.prob","estimate" = "prob")


ggplot(proportions_for_plot,aes(x = reorder(response,-estimate),y = estimate)) +
  geom_bar(stat = "identity",width = 0.7,fill = "darkgreen") +
  geom_errorbar(aes(ymin = lower_ci,ymax = upper_ci),width = 0.2) +
  geom_text(aes(label = paste0(100*round(estimate,2),"%")),vjust = 1.6,color = "white",size = 3) +
  xlab("cheese type") +
  ylab("proportion of people choosing this type")

controlled_for_no_split

但是,我对生成相同的条形图很感兴趣,只是它会分割成男性和女性的条形图


这是我要获取的情节

(忽略此演示中的值)

enter image description here

一种方法是按性别对数据进行子集化,对每个子集运行相同的模型,生成两个条形图并将它们组合在一起。但是,我想将gender纳入模型中作为一个因素,然后才输出分割条形图。由于gender已经是模型的一部分,因此可以部分解决此问题: fit <- nnet::multinom(cheese_response ~ age + gender + weight,data = cheese_df)

仍然,为了将预测性别划分,为了在条形图中并排比较它们,我遇到了麻烦。这是因为effects::Effect()仅接受一个向量进入其given.values参数。否则,我将执行以下操作来提供预测(就像使用predict时那样):

control_by_gender <-
  expand.grid(
    age = 50,weight = 75,gender = c(0,1)
  )

> control_by_gender

##   age weight gender
## 1  50     75      0
## 2  50     75      1

有什么想法在处理如上所示的多项模型对象时如何获得这样的倍数而不是聚焦的预测?我的最终目标是按性别划分的条形图,如上面的演示所示。我一直在使用Effects::effect生成预测,但是对任何可以做多种预测技巧的替代方案都持开放态度。

解决方法

为什么不将关卡应用到effects::Effect调用中


prediction <- do.call(rbind,lapply(0:1,function(x) {
    eff <- effects::Effect("age",fit,given.values =c(age = 50,weight = 75,gender = x),xlevels = list(age =c(45,90)))
    data.frame(level=x,eff$prob,eff$lower.prob,eff$upper.prob) %>% slice(1)
    }))


proportions_for_plot <-
  prediction %>% 
  pivot_longer(.,cols = !level,names_to = c(".value","response"),names_pattern = "(.*)\\.(.*$)") %>%
  rename("lower_ci" = "L.prob","upper_ci" = "U.prob","estimate" = "prob")


ggplot(proportions_for_plot,aes(x = as.factor(response),y = estimate,fill=factor(level))) +
  geom_bar(stat = "identity",width = 0.7,position="dodge") +
  geom_errorbar(aes(ymin = lower_ci,ymax = upper_ci),position=position_dodge(.9),width = 0.2) +
  geom_text(aes(label = paste0(100*round(estimate,2),"%")),vjust = 1.6,color = "white",size = 3,position=position_dodge(.9)) +
  xlab("cheese type") +
  ylab("proportion of people choosing this type")

,

此答案使用与@Abdessabour Mtk相同的直觉,只是使用purrr::map并进行了一些重构:

make_eff_df <- function(gender,fit) {
  Effect("age",xlevels = list(age = c(45,90)),given.values = c(age = 50,gender = gender)) %>%
    as_tibble() %>%
    mutate(gender = gender) %>%
    select(gender,matches("[a-z\\.]?prob")) %>%
    slice(1)
}


map_dfr(0:1,make_eff_df,fit) %>% 
  pivot_longer(-gender,names_pattern = "(.+)\\.(.+$)") %>%
  rename(lower_ci = "L.prob",upper_ci = "U.prob",estimate = "prob") %>%
  mutate(across(1:2,as.factor)) %>%
  ggplot(aes(x = reorder(response,-estimate),fill = gender)) +
  geom_bar(stat = "identity",position = position_dodge(.9)) +
  geom_errorbar(aes(ymin = lower_ci,position = position_dodge(.9),width = 0.2) +
  geom_text(aes(label = scales::percent(estimate,accuracy = 1)),position=position_dodge(.9)) +
  labs(x = "cheese type",y = "proportion of people choosing this type")

plot