问题描述
我有这个数据集
study_ID title experiment question_ID participant_ID estimate_level estimate correct_answer question type category age gender
<dbl> <chr> <dbl> <chr> <int> <chr> <dbl> <dbl> <chr> <chr> <chr> <int> <chr>
1 11 Dallacker_Parents'_co… 1 1 1 individual 3 10 How many sugar cubes does or… unlim… nutriti… 32 Female
2 11 Dallacker_Parents'_co… 1 2 1 individual 10 11.5 How many sugar cubes does a … unlim… nutriti… 32 Female
3 11 Dallacker_Parents'_co… 1 3 1 individual 7 6.5 How many sugar cubes does a … unlim… nutriti… 32 Female
4 11 Dallacker_Parents'_co… 1 4 1 individual 1 16.5 How many sugar cubes does a … unlim… nutriti… 32 Female
5 11 Dallacker_Parents'_co… 1 5 1 individual 7 11 How many sugar cubes does a … unlim… nutriti… 32 Female
6 11 Dallacker_Parents'_co… 1 6 1 individual 5 2.5 How many sugar cubes does a … unlim… nutriti… 32 Female
7 11 Dallacker_Parents'_co… 1 1 2 individual 2 10 How many sugar cubes does or… unlim… nutriti… 29 Female
8 11 Dallacker_Parents'_co… 1 2 2 individual 10 11.5 How many sugar cubes does a … unlim… nutriti… 29 Female
9 11 Dallacker_Parents'_co… 1 3 2 individual 1.5 6.5 How many sugar cubes does a … unlim… nutriti… 29 Female
10 11 Dallacker_Parents'_co… 1 4 2 individual 2 16.5 How many sugar cubes does a … unlim… nutriti… 29 Female
此数据集中有6个问题,每个问题都有一个correct_answer
列和一个estimate
列。我正在尝试为每个问题计算一个数量级,以便获得被低估或高估且正确估算的人的百分比。
例如,对于6个问题中的每一个,它都将返回类似的结果:80%被低估,10%被高估,而10%回答正确。
我该怎么做?我感到难过。预先感谢!
这是举报
dput(head(DF,10))
structure(list(study_ID = c(5,5,5),title = c("5_Jayles_Debiasing_The_Crowd","5_Jayles_Debiasing_The_Crowd","5_Jayles_Debiasing_The_Crowd"),experiment = c(1,1,1),question_ID = c(1,participant_ID = c(1,2,3,4,6,7,8,9,10),estimate_level = c("individual","individual","individual"),estimate = c(2e+07,4500000,21075541,2e+07,1e+06,1.1e+07,2.5e+07,8e+06,1.6e+07,9800000),correct = c(3.8e+07,3.8e+07,3.8e+07),question = c("What is the population of Tokyo and its agglomeration?","What is the population of Tokyo and its agglomeration?","What is the population of Tokyo and its agglomeration?"),type = c("unlimited","unlimited","unlimited"),category = c("demographics","demographics","demographics"
),age = c("NA","NA","NA"),gender = c("NA","NA")),row.names = c(NA,-10L),class = c("tbl_df","tbl","data.frame"))
解决方法
这是一种dplyr
的方法:
library(dplyr)
df %>%
group_by(question_ID) %>%
summarize(prop_over = mean(estimate > correct),prop_under = mean(estimate < correct),prop_correct = mean(estimate == correct)
)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 1 x 4
# question_ID prop_over prop_under prop_correct
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 1 0
,
list1 <- lapply(split(DF,DF$question_ID),function (x) {
overestimated <- 100 * length(which(x$estimate > x$correct)) / length(x$estimate)
underestimated <- 100 * length(which(x$estimate < x$correct)) / length(x$estimate)
correct <- 100 * length(which(x$estimate == x$correct)) / length(x$estimate)
data.frame(overestimated,underestimated,correct)
})
list2 <- mapply(function (x,y) {
x$question_ID <- y
return (x)
},x = list1,y = names(list1),SIMPLIFY = F)
Percent_Data <- do.call("rbind",list2)
Percent_Data <- Percent_Data[,c(which(colnames(Percent_Data) == "question_ID"),which(colnames(Percent_Data) != "question_ID"))]
Percent_Data
# question_ID overestimated underestimated correct
# 1 1 0 100 0