R中的外部联接后,如何从特定时间段复制值以替换NA?

问题描述

我的data.frame_1从2017年1月1日到2020年10月1日,每个季度都有相关信息,如下所示:

DATE          CLINIC_ID    NR_INDIVIDUALS    REGION_ID   TOTAL_NR_INDIVIDUALS     AVERAGE_INDEX
2017-01-01    A11          3                 A           100                      3
2017-01-01    A11          10                B           100                      3
2017-01-01    A12          14                C           130                      4
2017-01-01    A13          5                 D           110                      5
                        ....
2017-04-01    A11          2                 A           96                       4
2017-04-01    A11          9                 B           96                       4
2017-04-01    A12          13                C           100                      4
2017-04-01    A13          5                 D           105                      7
                        ....
2017-07-01    A11          2                 A           89                       4
2017-07-01    A11          8                 B           89                       4
2017-07-01    A12          14                C           105                      5
2017-07-01    A13          5                 D           90                       7
                        ....
2020-10-01    A11          6                 A           97                       4
2020-10-01    A11          14                B           97                       4
2020-10-01    A12          15                C           90                       6
2020-10-01    A13          3                 D           92                       7

我的data.frame_2仅具有2个时间段的信息(2019-09-01和2020-05-01),如下所示:

DATE          REGION_ID       CONNECTIVITY      PERCENTAGE
2019-09-01    A               0<2Mbit/s         3
2019-09-01    A               2<5Mbit/s         4
2019-09-01    A               5<10Mbit/s        13
2019-09-01    A               10<30Mbit/s       60
2019-09-01    A               30<300Mbit/s      10
2019-09-01    A               >=300Mbit/s       10
                        ....
2020-05-01    A               0<2Mbit/s         3
2020-05-01    A               2<5Mbit/s         4
2020-05-01    A               5<10Mbit/s        3
2020-05-01    A               10<30Mbit/s       25
2020-05-01    A               30<300Mbit/s      35
2020-05-01    A               >=300Mbit/s       30

我正在做外部联接:

data.frame_3 <- merge(x = data.frame_1,y = data.frame_2,by = c("DATE","REGION_CODE"),all = TRUE)

问题1:自然,我在data.frame_1中获得了CONNECTIVITYPERCENTAGE的所有DATE的NA。我想用2019-09-01的值填充2019年所有月份的CONNECTIVITYPERCENTAGE的值,而使用2020-05-01的值填充2020年的所有月份的值DATE CLINIC_ID TOTAL_NR_INDIVIDUALS AVERAGE_AGE 2017-01-01 A11 100 40 2017-01-01 A11 100 40 2017-01-01 A12 130 44 2017-01-01 A13 110 43 .... 2017-02-01 A11 96 41 2017-02-01 A11 96 41 2017-02-01 A12 100 43 2017-02-01 A13 105 43 .... 2017-03-01 A11 89 41 2017-03-01 A11 89 41 2017-03-01 A12 105 42 2017-03-01 A13 90 42 .... 2020-10-01 A11 97 42 2020-10-01 A11 97 42 2020-10-01 A12 90 43 2020-10-01 A13 92 43 。我该怎么办?

在另一种情况下,我有data.frame_4,如下所示:

data.frame_5 <- merge(x = data.frame_1,y = data.frame_4,"CLINIC_ID"),all = TRUE)

我正在做外部联接:

AVERAGE_INDEX

问题2 :我想将2017年4月1日的$general = DB::table('generals') ->join('categories','generals.cName','=','categories.id') ->join('tags','generals.jsontext','tags.id') ->select('generals.*','categories.categoryName','tags.tagName') ->get(); (以及data.frame_1中的其他列)中的值复制到2017-03-01和2017- 02-01;从2017-07-01到2017-06-01和2017-05-01下的观察结果,依此类推。该怎么做?

解决方法

请下次提供reproducible example。在这里,我为您创建了一些最小的东西。

# question1 ---------------------------------------------------------------

library(lubridate)
date <- as_date("2017-01-01")+months(0:35)
values <- c(1:36)
df <- data.frame(date,values)
# question 1: replace all 2019 values with May values
df$newvalue <- ifelse(year(df$date)==2019,df$value[df$date=="2019-05-01"],df$values)
tail(df,10)
#>          date values newvalue
#> 27 2019-03-01     27       29
#> 28 2019-04-01     28       29
#> 29 2019-05-01     29       29
#> 30 2019-06-01     30       29
#> 31 2019-07-01     31       29
#> 32 2019-08-01     32       29
#> 33 2019-09-01     33       29
#> 34 2019-10-01     34       29
#> 35 2019-11-01     35       29
#> 36 2019-12-01     36       29
#as you can see the newvalues are correctly using May data for 2019


# question 2: replacing the values of months 3 and 2 by 4 --------
# define the correct months to replace for each row
df$refdate <- ifelse(month(df$date) %in% c(2,3),(paste(year(df$date),"04","01",sep="-")),as.character(df$date))
df$refdate <- ifelse(month(df$refdate) %in% c(5,6),"07",as.character(df$refdate))
df$refdate <- as_date(df$refdate)
df$result <- df$values[match(df$refdate,df$date)]
# > head(df[,c("date","refdate","result")],8)
# date    refdate result
# 1 2017-01-01 2017-01-01      1
# 2 2017-02-01 2017-04-01      4
# 3 2017-03-01 2017-04-01      4
# 4 2017-04-01 2017-04-01      4
# 5 2017-05-01 2017-07-01      7
# 6 2017-06-01 2017-07-01      7
# 7 2017-07-01 2017-07-01      7
# 8 2017-08-01 2017-08-01      8

# as you can see here feb and march were replaced by april values,# may,june replaced by July

这样,您可以使用非常有用的函数match避免任何显式循环。在尝试进行任何形式的循环之前,我总是尝试依靠此功能。