问题描述
我先使用GROUPBY,然后使用SUM,然后使用SUMMARIZE将总标签添加到数据框。但表示总水平的%数据错误。因此,我想用具有正确结果的计算覆盖百分比变量“缺勤百分比”。问题在于它是一个长数据集,无法手动执行。寻找好的解决方案,LOOP还是其他?
代码:
Date=c("01/09/2020","01/09/2020","02/09/2020","02/09/2020")
Asset=c("Blue Hotel","Blue Hotel","Green Hotel","Green Hotel")
Variable=c("hotel staff","bar staff","absent staff","percentage absent
staff","hotel staff","percentage absent staff","percentage absent staff")
value=c(5,10,3,0.2,4,8,2,0.17,5,0.20,6,0.33)
df=data.frame(Date,Asset,Variable,value)
#to create totals
df2= df %>%
group_by(Date,Variable) %>%
summarise(value = sum(as.numeric(value),na.rm=F)) %>% ungroup()
解决方法
我不确定您要什么计算,因为第一个“正确”计算看起来像absent_staff /(hotel_staff + bar_staff + absent_staff),第二个正确计算看起来像absent_staff /(hotel_staff + bar_staff)。但是,您可以根据自己的喜好设计以下解决方案。
df2= df %>%
group_by(Date,Variable) %>%
summarise(value = sum(as.numeric(value),na.rm=F)) %>%
ungroup() %>%
group_by(Date) %>%
mutate(value = case_when(
Variable == "percentage absent staff" ~ value[which(Variable == "absent staff")]/
sum(value[which(Variable %in% c("absent staff","bar staff","hotel staff"))]),TRUE ~ value)
)
df2
# # A tibble: 8 x 3
# # Groups: Date [2]
# Date Variable value
# <chr> <chr> <dbl>
# 1 01/09/2020 absent staff 5
# 2 01/09/2020 bar staff 18
# 3 01/09/2020 hotel staff 9
# 4 01/09/2020 percentage absent staff 0.156
# 5 02/09/2020 absent staff 6
# 6 02/09/2020 bar staff 13
# 7 02/09/2020 hotel staff 11
# 8 02/09/2020 percentage absent staff 0.2
在上面,您通过Date
对汇总数据进行了分组,然后将值替换为条件表达式。当Variable
等于"percentage absent staff"
时,该值将是"absent staff"
的值除以"absent staff","hotel staff"
的值之和。因此,如果您真的想从上面进行第二次计算,则可以将"absent staff"
排除在此向量之外。否则,value
将返回与原始值相同的值。
编辑
要回答评论中的问题,如果同一变量-Variable
中还有其他具有相同结构的常驻值,则可以使用以下项来代替它们:
Date=c("01/09/2020","01/09/2020","02/09/2020","02/09/2020")
Asset=c("Blue Hotel","Blue Hotel","Green Hotel","Green Hotel")
Variable=c("hotel staff","absent staff","percentage absent staff","hotel staff","percentage absent staff")
value=c(5,10,3,0.2,4,8,2,0.17,5,0.20,6,0.33)
df=data.frame(Date,Asset,Variable,value)
#to create totals
dfr <- df
dfr$Variable <- gsub("staff","residents",dfr$Variable)
dfr$value <- rpois(nrow(dfr),25)
df <- bind_rows(df,dfr)
df[c(1:5,17:21),]
df2= df %>%
group_by(Date,na.rm=F)) %>% ungroup()
df2a= df2 %>%
group_by(Date,Variable) %>%
summarise(value = sum(as.numeric(value),na.rm=F)) %>%
ungroup() %>%
group_by(Date) %>%
mutate(value = case_when( Variable == "percentage absent staff" ~ value[which(Variable == "absent staff")]/
sum(value[which(Variable %in% c("absent staff",Variable == "percentage absent residents" ~ value[which(Variable == "absent residents")]/
sum(value[which(Variable %in% c("absent residents","bar residents","hotel residents"))]),TRUE ~ value) )