问题描述
我有一个数据框/小标题,其中包含bin的因子变量。由于原始数据未包含这5年范围内的观测值,因此缺少bin。有没有一种方法可以轻松完成系列而不必解构间隔?
这是样本df。
library(tibble)
df <- structure(list(bin = structure(c(1L,3L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L,17L),.Label = c("[1940,1945]","(1945,1950]","(1950,1955]","(1955,1960]","(1960,1965]","(1965,1970]","(1970,1975]","(1975,1980]","(1980,1985]","(1985,1990]","(1990,1995]","(1995,2000]","(2000,2005]","(2005,2010]","(2010,2015]","(2015,2020]","(2020,2025]"),class = "factor"),Values = c(2L,4L,26L,30L,87L,107L,290L,526L,299L,166L,502L,8L)),row.names = c(NA,-15L),class = c("tbl_df","tbl","data.frame"))
df
# A tibble: 15 x 2
bin Values
<fct> <int>
1 [1940,1945] 2
2 (1950,1955] 4
3 (1960,1965] 14
4 (1965,1970] 11
5 (1970,1975] 8
6 (1975,1980] 26
7 (1980,1985] 30
8 (1985,1990] 87
9 (1990,1995] 107
10 (1995,2000] 290
11 (2000,2005] 526
12 (2005,2010] 299
13 (2010,2015] 166
14 (2015,2020] 502
15 (2020,2025] 8
我想添加丢失的(1945,1950]
和(1955,1960]
箱。
解决方法
bins
已具有所需的levels
。因此,您可以将complete
中的df
用作:
tidyr::complete(df,bin = levels(bin),fill = list(Values = 0))
# A tibble: 17 x 2
# bin Values
# <chr> <dbl>
# 1 (1945,1950] 0
# 2 (1950,1955] 4
# 3 (1955,1960] 0
# 4 (1960,1965] 14
# 5 (1965,1970] 11
# 6 (1970,1975] 8
# 7 (1975,1980] 26
# 8 (1980,1985] 30
# 9 (1985,1990] 87
#10 (1990,1995] 107
#11 (1995,2000] 290
#12 (2000,2005] 526
#13 (2005,2010] 299
#14 (2010,2015] 166
#15 (2015,2020] 502
#16 (2020,2025] 8
#17 [1940,1945] 2
,
df <- orig_df %>%
mutate(bin = cut_width(Year,width = 5,center = 2.5))
df2 <- df %>%
group_by(bin) %>%
summarize(Values = n()) %>%
ungroup()
tibble(bin = levels(df$bin)) %>%
left_join(df2) %>%
replace_na(list(Values = 0))