如何计算 r

问题描述

我有一个关于并购 (M&A) 的大数据框(90 万行)。

df 有四列:date(并购完成的时间)、target_nation(合并/收购哪个国家的公司)、acquiror_nation(收购方是哪个国家的公司)和 big_corp(收购方是否是大公司,TRUE 表示公司是大公司)。

这是我的 df 示例:

> df <- structure(list(date = c(2000L,2000L,2001L,2002L,2002L),target_nation = c("Uganda","Uganda","Uganda"),acquiror_nation = c("France","Germany","France","Germany"),big_corp_TF = c(TRUE,FALSE,TRUE,TRUE)),row.names = c(NA,-8L))

> df 

   date target_nation acquiror_nation big_corp_TF
1: 2000        Uganda          France        TRUE
2: 2000        Uganda         Germany       FALSE
3: 2001        Uganda          France        TRUE
4: 2001        Uganda          France       FALSE
5: 2001        Uganda         Germany       FALSE
6: 2002        Uganda          France        TRUE
7: 2002        Uganda          France        TRUE
8: 2002        Uganda         Germany        TRUE

根据这些数据,我想创建一个新变量,表示特定收购国的大公司进行的并购份额,计算 2 年的平均值。(对于我的实际练习,我将计算 5 年的平均值,但让我们在这里简化一下)。所以法国的大公司会有一个新的变量,德国的大公司也会有一个新的变量。

到目前为止,我设法做的是 1) 计算某一年特定目标国家的并购总数; 2) 统计某并购国某大公司在某年某特定目标国的并购总数。我加入了这两个dfs,以方便计算我想要的平均值。这是我使用的代码和由此产生的新 df:

##counting total rows for target nations
df2 <- df %>%
 group_by(date,target_nation) %>%
 count(target_nation)

##counting total rows conducted by small or big corps for certain acquiror nations

df3 <- df %>%
  group_by(date,target_nation,acquiror_nation) %>%
  count(big_corp_TF)

##selecting rows that were conducted by big corps

df33 <- df3 %>%
  filter(big_corp_TF == TRUE)

##merging df2 and df33

df4 <- df2 %>%
  left_join(df33,by = c("date" = "date","target_nation" = "target_nation"))

df4 <- as.data.frame(df4)

> df4

  date target_nation n.x acquiror_nation big_corp_TF n.y
1 2000        Uganda   2          France        TRUE   1
2 2001        Uganda   3          France        TRUE   1
3 2002        Uganda   3          France        TRUE   2
4 2002        Uganda   3         Germany        TRUE   1

n.x 这里是特定目标国家在某一年的并购总数(行); n.y 为特定目标国的特定收购国的大公司进行的并购总数(行)。

有了这个新的数据框架 df4,我现在可以轻松计算特定收购国的大公司在特定的一年内在特定的目标国进行的并购份额。例如,让我们计算法国的这一份额:

df5 <- df4 %>% 
  filter(acquiror_nation == "France") %>%
  mutate(France_bigcorp_share_1year = n.y / n.x)

  date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_1year
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.3333333
3 2002        Uganda   3          France        TRUE   2                  0.6666667

但是,我不知道如何计算特定收购国大公司的并购份额,计算 2 年的平均值。

这是所需变量的样子:

  date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_2years
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.4000000
3 2002        Uganda   3          France        TRUE   2                  0.5000000

请注意,2000 年的份额将保持不变,因为没有上一年使其成为 2 年的平均值; 2001 年将变为 0.4(因为 (1+1)/(2+3) = 0.4); 2002 年将变为 0.5(因为 (1+2)/(3+3) = 0.5)。

您知道如何编写代码来计算两年的平均份额吗?我想我需要在这里使用 for 循环,但我不知道如何使用。任何建议将不胜感激。

--

编辑: AnilGoyal 的代码与示例数据完美配合,但我的实际数据显然更混乱,因此我想知道是否有解决我遇到的问题的方法

我的实际数据集有时会跳过一年,或者有时不包括前一行中包含的 acquiror_nations。请查看我的实际数据的更准确示例:

> df_new <- structure(list(date = c(2000L,2003L,2004L,2006L,2006L
),"France"),-15L))

> df_new 

    date target_nation acquiror_nation big_corp_TF
 1: 2000        Uganda          France     TRUE
 2: 2000        Uganda         Germany    FALSE
 3: 2001        Uganda          France     TRUE
 4: 2001        Uganda          France    FALSE
 5: 2001        Uganda         Germany    FALSE
 6: 2002        Uganda          France     TRUE
 7: 2002        Uganda          France     TRUE
 8: 2002        Uganda         Germany     TRUE
 9: 2003        Uganda         Germany     TRUE
10: 2003        Uganda         Germany    FALSE
11: 2004        Uganda          France     TRUE
12: 2004        Uganda          France    FALSE
13: 2004        Uganda         Germany     TRUE
14: 2006        Uganda          France     TRUE
15: 2006        Uganda          France     TRUE

注意:2003 年法国没有行;并且没有 2005 年。

如果我运行 Anil 的第一个代码,结果是以下 tibble:

   date target_nation acquiror_nation    n1    n2 share
  <int> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1   0.5
2  2001 Uganda        France              3     1   0.4
3  2002 Uganda        France              3     2   0.5
4  2004 Uganda        France              3     1   0.5
5  2006 Uganda        France              2     2   0.6

注意:2003 年和 2005 年法国没有结果;我希望有 2003 年和 2005 年的结果(因为我们正在计算 2 年的平均值,因此我们应该能够得到 2003 年和 2005 年的结果)。另外,2006年的份额实际上是不正确的,因为它应该是1(它应该取2005年的值(0s)而不是2004年的值来计算平均值)。

我希望能够收到以下小标题

       date target_nation acquiror_nation    n1    n2 share
      <int> <chr>         <chr>           <dbl> <int> <dbl>
    1  2000 Uganda        France              2     1   0.5
    2  2001 Uganda        France              3     1   0.4
    3  2002 Uganda        France              3     2   0.5
    4  2003 Uganda        France              2     0   0.4
    5  2004 Uganda        France              3     1   0.2
    6  2005 Uganda        France              0     0   0.33
    7  2006 Uganda        France              2     2   1.0

注意:请注意,2006 年的结果也不同(因为我们现在采用 2005 年而不是 2004 年的两年平均值)。

您认为有可能找到一种方法输出所需的 tibble 吗?我知道这是原始数据的问题:它只是缺少某些数据点。但是,将它们包含到原始数据集中似乎非常不方便;最好将它们包含在中途,例如在数完 n1 和 n2 之后。但最方便的方法是什么?

EDIT2: Anil 的新代码可以很好地处理上面的数据样本,但在处理更复杂的数据样本(包括多个 target_nation)时遇到了不希望出现的问题。这是一个更短但更复杂的数据样本:

> df_new_complex <- structure(list(date = c(2000L,1999L,"Mozambique","Mozambique"),TRUE
)),-11L))

> df_new_complex 

date target_nation acquiror_nation big_corp_TF
 1: 2000        Uganda          France        TRUE
 2: 2000        Uganda         Germany       FALSE
 3: 2001        Uganda          France        TRUE
 4: 2001        Uganda          France       FALSE
 5: 2001        Uganda         Germany       FALSE
 6: 2003        Uganda         Germany        TRUE
 7: 2003        Uganda         Germany       FALSE
 8: 1999    Mozambique         Germany       FALSE
 9: 2001    Mozambique          France        TRUE
10: 2002    Mozambique          France       FALSE
11: 2002    Mozambique         Germany        TRUE

如您所见,此数据样本包括两个 target_nation。 Anil 的代码,其中 param <- c("France","Germany"),产生以下 tibble:

    date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0    
 2  1999 Mozambique    Germany             1     0 0    
 3  1999 Uganda        France              0     0 0    
 4  1999 Uganda        Germany             0     0 0    
 5  2000 Mozambique    France              0     0 0    
 6  2000 Mozambique    Germany             0     0 0    
 7  2000 Uganda        France              2     1 0.25 
 8  2000 Uganda        Germany             2     0 0.167
 9  2001 Mozambique    France              1     1 0.4  
10  2001 Mozambique    Germany             1     0 0.333
11  2001 Uganda        France              3     1 0.333
12  2001 Uganda        Germany             3     0 0.25 
13  2002 Mozambique    France              2     0 0.2  
14  2002 Mozambique    Germany             2     1 0.25 
15  2002 Uganda        France              0     0 0.25 
16  2002 Uganda        Germany             0     0 0.25 
17  2003 Mozambique    France              0     0 0.25 
18  2003 Mozambique    Germany             0     0 0.25 
19  2003 Uganda        France              2     0 0.167
20  2003 Uganda        Germany             2     1 0.25 

这里不希望看到的是,该代码为乌干达创建了 1999 年,为莫桑比克创建了 2003 年(后者不是什么问题)。在 1999 年,乌干达没有数据样本中显示的投资,因此为此提供数值是没有意义的(它可能有 NA,或者根本没有)。莫桑比克在 2003 年也没有投资,所以我不想计算莫桑比克当年的份额。

我找到了一个解决方法,我在代码的早期过滤了一个特定的目标国家,就像这样:

correct1 <- df_new_complex %>% 
  filter(target_nation == "Mozambique") %>%
  mutate(d = 1) %>% ...

#I do the same for another target_nation

correct2 <- df_new_complex %>% 
  filter(target_nation == "Uganda") %>%
  mutate(d = 1) %>% ...

#I then use rbind

correct <- rbind(correct1,correct2)

#which produces the desired tibble (without a year 2003 for Mozambique and 1999 for Uganda).

> correct 

date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0    
 2  1999 Mozambique    Germany             1     0 0    
 3  2000 Mozambique    France              0     0 0    
 4  2000 Mozambique    Germany             0     0 0    
 5  2001 Mozambique    France              1     1 1    
 6  2001 Mozambique    Germany             1     0 0 
 7  2002 Mozambique    France              2     0 0.33 
 8  2002 Mozambique    Germany             2     1 0.333
 9  2000 Uganda        France              2     1 0.5  
10  2000 Uganda        Germany             2     0 0.25 
11  2001 Uganda        France              3     1 0.286
12  2001 Uganda        Germany             3     0 0.2  
13  2002 Uganda        France              0     0 0.167
14  2002 Uganda        Germany             0     0 0.167
15  2003 Uganda        France              2     0 0    
16  2003 Uganda        Germany             2     1 0.25 

有什么更快的方法可以做到这一点?我有一个所需的 target_nations 列表。也许可以创建一个循环,我先计算一个 target_nation,然后再计算另一个;然后绑定它们;然后是另一个;然后 rbind 等。或者有更好的方法吗?

解决方法

使用包 runner 你可以做这样的事情

df <- structure(list(date = c(2000L,2000L,2001L,2002L,2002L),target_nation = c("Uganda","Uganda","Uganda"),acquiror_nation = c("France","Germany","France","Germany"),big_corp_TF = c(TRUE,FALSE,TRUE,TRUE)),row.names = c(NA,-8L))

library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>% 
  group_by(date,target_nation) %>%
  mutate(n1 = n()) %>%
  group_by(date,target_nation,acquiror_nation) %>%
  summarise(n1 = mean(n1),n2 = sum(big_corp_TF),.groups = 'drop') %>%
  filter(acquiror_nation == param) %>%
  mutate(share = sum_run(n2,k=2)/sum_run(n1,k=2))
#> # A tibble: 3 x 6
#>    date target_nation acquiror_nation    n1    n2 share
#>   <int> <chr>         <chr>           <dbl> <int> <dbl>
#> 1  2000 Uganda        France              2     1   0.5
#> 2  2001 Uganda        France              3     1   0.4
#> 3  2002 Uganda        France              3     2   0.5

即使你可以同时为所有国家做


df %>% 
  group_by(date,.groups = 'drop') %>%
  group_by(acquiror_nation) %>%
  mutate(share = sum_run(n2,k=2))
#> # A tibble: 6 x 6
#> # Groups:   acquiror_nation [2]
#>    date target_nation acquiror_nation    n1    n2 share
#>   <int> <chr>         <chr>           <dbl> <int> <dbl>
#> 1  2000 Uganda        France              2     1 0.5  
#> 2  2000 Uganda        Germany             2     0 0    
#> 3  2001 Uganda        France              3     1 0.4  
#> 4  2001 Uganda        Germany             3     0 0    
#> 5  2002 Uganda        France              3     2 0.5  
#> 6  2002 Uganda        Germany             3     1 0.167

针对修改后的场景,你需要做2件事-

  • 在两个 idx = date 函数中都包含参数 sum_run。这将根据需要更正输出,但不会包括丢失行/年的份额。
  • 要包括缺失的年份,您还需要 tidyr::complete,如下所示-
param <- 'France'
df_new %>% 
  mutate(d = 1) %>%
  complete(date = seq(min(date),max(date),1),nesting(target_nation,acquiror_nation),fill = list(d =0,big_corp_TF = FALSE)) %>%
  group_by(date,target_nation) %>%
  mutate(n1 = sum(d)) %>%
  group_by(date,k=2,idx = date)/sum_run(n1,idx = date))

# A tibble: 7 x 6
   date target_nation acquiror_nation    n1    n2 share
  <dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1 0.5  
2  2001 Uganda        France              3     1 0.4  
3  2002 Uganda        France              3     2 0.5  
4  2003 Uganda        France              2     0 0.4  
5  2004 Uganda        France              3     1 0.2  
6  2005 Uganda        France              0     0 0.333
7  2006 Uganda        France              2     2 1

与上面类似,您可以一次为所有国家/地区执行此操作(通过 group_by 进行 replcae 过滤)

df_new %>% 
  mutate(d = 1) %>%
  complete(date = seq(min(date),idx = date))

# A tibble: 14 x 6
# Groups:   acquiror_nation [2]
    date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  2000 Uganda        France              2     1 0.5  
 2  2000 Uganda        Germany             2     0 0    
 3  2001 Uganda        France              3     1 0.4  
 4  2001 Uganda        Germany             3     0 0    
 5  2002 Uganda        France              3     2 0.5  
 6  2002 Uganda        Germany             3     1 0.167
 7  2003 Uganda        France              2     0 0.4  
 8  2003 Uganda        Germany             2     1 0.4  
 9  2004 Uganda        France              3     1 0.2  
10  2004 Uganda        Germany             3     1 0.4  
11  2005 Uganda        France              0     0 0.333
12  2005 Uganda        Germany             0     0 0.333
13  2006 Uganda        France              2     2 1    
14  2006 Uganda        Germany             2     0 0

进一步编辑

  • 这很容易。从 target_nation 中删除 nesting 并在 group_by 之前添加 complete

简单。不是吗

df_new_complex %>%
  mutate(d = 1) %>%
  group_by(target_nation) %>%
  complete(date = seq(min(date),nesting(acquiror_nation),k=2))

# A tibble: 16 x 6
# Groups:   acquiror_nation [2]
    date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0    
 2  1999 Mozambique    Germany             1     0 0    
 3  2000 Mozambique    France              0     0 0    
 4  2000 Mozambique    Germany             0     0 0    
 5  2000 Uganda        France              2     1 0.5  
 6  2000 Uganda        Germany             2     0 0    
 7  2001 Mozambique    France              1     1 0.667
 8  2001 Mozambique    Germany             1     0 0    
 9  2001 Uganda        France              3     1 0.5  
10  2001 Uganda        Germany             3     0 0    
11  2002 Mozambique    France              2     0 0.2  
12  2002 Mozambique    Germany             2     1 0.2  
13  2002 Uganda        France              0     0 0    
14  2002 Uganda        Germany             0     0 0.5  
15  2003 Uganda        France              2     0 0    
16  2003 Uganda        Germany             2     1 0.5 
,

我注意到你已经删除了你原来的问题。

在我的解决方案中,即使没有 2003 和 2005 行,我也可以直接计算 bigcorp_share_2years

library(data.table)
df_new <- structure(list(date = c(2000L,2003L,2004L,2006L,2006L
),"France"),-15L))
setDT(df_new)

# NY is the total observation number for two consecutive years.
this = 0
df_new[,NR  := .N,by = date] # NR is each group's length
df_new[,NY  := { last = this; this = last(NR); last + this },by = date]
# special deal with single year,e.g. 2006.
df_new[,NY  := ifelse( (date - 1) %in% date,NY,NR)]

# snx: count big_corp_TF for acquiror_nation,which will be used to calculate NX
df_new[,snx := sum(big_corp_TF),by = .(date,acquiror_nation)]

# df2: remove column big_crop_TF for unique operation
df2 <- df_new[,c(1:3,5:7)][,unique(.SD)]

# roll count for two consecutive years
df2[,NX := frollsum(snx,2),by=.(acquiror_nation)]
df2[,NX := ifelse( (date - 1) %in% date,NX,snx),acquiror_nation][]
#>     date target_nation acquiror_nation NR NY snx NX
#>  1: 2000        Uganda          France  2  2   1  1
#>  2: 2000        Uganda         Germany  2  2   0  0
#>  3: 2001        Uganda          France  3  5   1  2
#>  4: 2001        Uganda         Germany  3  5   0  0
#>  5: 2002        Uganda          France  3  6   2  3
#>  6: 2002        Uganda         Germany  3  6   1  1
#>  7: 2003        Uganda         Germany  2  5   1  2
#>  8: 2004        Uganda          France  3  5   1  1
#>  9: 2004        Uganda         Germany  3  5   1  2
#> 10: 2006        Uganda          France  2  2   2  2

df2[,bigcorp_share_2years := NX/NY]
df2[,.(date,bigcorp_share_2years),by=.(acquiror_nation)]
#>     acquiror_nation date target_nation NY NX bigcorp_share_2years
#>  1:          France 2000        Uganda  2  1            0.5000000
#>  2:          France 2001        Uganda  5  2            0.4000000
#>  3:          France 2002        Uganda  6  3            0.5000000
#>  4:          France 2004        Uganda  5  1            0.2000000
#>  5:          France 2006        Uganda  2  2            1.0000000
#>  6:         Germany 2000        Uganda  2  0            0.0000000
#>  7:         Germany 2001        Uganda  5  0            0.0000000
#>  8:         Germany 2002        Uganda  6  1            0.1666667
#>  9:         Germany 2003        Uganda  5  2            0.4000000
#> 10:         Germany 2004        Uganda  5  2            0.4000000

reprex package (v2.0.0) 于 2021 年 5 月 3 日创建