有没有办法完成或扩展间隔因子变量

问题描述

我有一个数据框/小标题,其中包含bin的因子变量。由于原始数据未包含这5年范围内的观测值,因此缺少bin。有没有一种方法可以轻松完成系列而不必解构间隔?

这是样本df。

library(tibble)

df <- structure(list(bin = structure(c(1L,3L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L,17L),.Label = c("[1940,1945]","(1945,1950]","(1950,1955]","(1955,1960]","(1960,1965]","(1965,1970]","(1970,1975]","(1975,1980]","(1980,1985]","(1985,1990]","(1990,1995]","(1995,2000]","(2000,2005]","(2005,2010]","(2010,2015]","(2015,2020]","(2020,2025]"),class = "factor"),Values = c(2L,4L,26L,30L,87L,107L,290L,526L,299L,166L,502L,8L)),row.names = c(NA,-15L),class = c("tbl_df","tbl","data.frame"))

df
# A tibble: 15 x 2
   bin         Values
   <fct>        <int>
 1 [1940,1945]      2
 2 (1950,1955]      4
 3 (1960,1965]     14
 4 (1965,1970]     11
 5 (1970,1975]      8
 6 (1975,1980]     26
 7 (1980,1985]     30
 8 (1985,1990]     87
 9 (1990,1995]    107
10 (1995,2000]    290
11 (2000,2005]    526
12 (2005,2010]    299
13 (2010,2015]    166
14 (2015,2020]    502
15 (2020,2025]      8

我想添加丢失的(1945,1950](1955,1960]箱。

解决方法

bins已具有所需的levels。因此,您可以将complete中的df用作:

tidyr::complete(df,bin = levels(bin),fill = list(Values = 0))

# A tibble: 17 x 2
#   bin         Values
#   <chr>        <dbl>
# 1 (1945,1950]      0
# 2 (1950,1955]      4
# 3 (1955,1960]      0
# 4 (1960,1965]     14
# 5 (1965,1970]     11
# 6 (1970,1975]      8
# 7 (1975,1980]     26
# 8 (1980,1985]     30
# 9 (1985,1990]     87
#10 (1990,1995]    107
#11 (1995,2000]    290
#12 (2000,2005]    526
#13 (2005,2010]    299
#14 (2010,2015]    166
#15 (2015,2020]    502
#16 (2020,2025]      8
#17 [1940,1945]      2
,
df <- orig_df %>% 
    mutate(bin = cut_width(Year,width = 5,center = 2.5)) 

df2 <- df %>% 
    group_by(bin) %>% 
    summarize(Values = n()) %>% 
    ungroup()
tibble(bin = levels(df$bin)) %>% 
    left_join(df2) %>% 
    replace_na(list(Values = 0))