您如何简单地将函数应用于R中不同长度的多重子集？

问题描述

我需要对一个列中不同长度的几个数据子集应用一个函数，并生成一个包含输出及其相关元数据的新数据框。

如何不借助for循环来做到这一点？ tapply()似乎是一个不错的起点，但是我在语法上很挣扎。

例如-我有这样的东西：

block plot id species type response
    1    1  1      w     a      1.5
    1    1  2      w     a      1
    1    1  3      w     a      2
    1    1  4      w     a      1.5
    1    2  5      x     a      5
    1    2  6      x     a      6
    1    2  7      x     a      7
    1    3  8      y     b      10 
    1    3  9      y     b      11
    1    3 10      y     b      9
    1    4 11      z     b      1
    1    4 12      z     b      3
    1    4 13      z     b      2
    2    5 14      w     a      0.5
    2    5 15      w     a      1
    2    5 16      w     a      1.5
    2    6 17      x     a      3
    2    6 18      x     a      2
    2    6 19      x     a      4
    2    7 20      y     b      13 
    2    7 21      y     b      12
    2    7 22      y     b      14
    2    8 23      z     b      2
    2    8 24      z     b      3
    2    8 25      z     b      4
    2    8 26      z     b      2
    2    8 27      z     b      4

我想产生这样的东西：

block plot species type mean.response
    1    1       w    a           1.5
    1    2       x    a           6
    1    3       y    b           10 
    1    4       z    b           2
    2    5       w    a           1
    2    6       x    a           3
    2    7       y    b           13
    2    8       z    b           3

解决方法

尝试一下。您可以使用group_by()来设置分组变量，然后使用summarise()来计算期望的变量。这里的代码使用dplyr：

library(dplyr)
#Code
newdf <- df %>% group_by(block,plot,species,type) %>% summarise(Mean=mean(response,na.rm=T))

输出：

# A tibble: 8 x 5
# Groups:   block,species [8]
  block  plot species type   Mean
  <int> <int> <chr>   <chr> <dbl>
1     1     1 w       a       1.5
2     1     2 x       a       6  
3     1     3 y       b      10  
4     1     4 z       b       2  
5     2     5 w       a       1  
6     2     6 x       a       3  
7     2     7 y       b      13  
8     2     8 z       b       3

或使用base R（-3用于省略聚合中的id变量）

#Base R
newdf <- aggregate(response~.,data=df[,-3],mean,na.rm=T)

输出：

  block plot species type response
1     1    1       w    a      1.5
2     2    5       w    a      1.0
3     1    2       x    a      6.0
4     2    6       x    a      3.0
5     1    3       y    b     10.0
6     2    7       y    b     13.0
7     1    4       z    b      2.0
8     2    8       z    b      3.0

使用了一些数据：

#Data
df <- structure(list(block = c(1L,1L,2L,2L),plot = c(1L,3L,4L,5L,6L,7L,8L,8L
),id = 1:27,species = c("w","w","x","y","z","z"),type = c("a","a","b","b"),response = c(1.5,1,2,1.5,5,6,7,10,11,9,3,0.5,4,13,12,14,4)),class = "data.frame",row.names = c(NA,-27L))

在末尾的注释中可重复使用输入dd的情况下，请使用以下任何一种方法：

# 1. aggregate.formula - base R
# Can use just response on left hand side if header doesn't matter.
aggregate(cbind(mean.response = response) ~ block + plot + species + type,dd,mean)

# 2. aggregate.default - base R
v <- c("block","plot","species","type")
aggregate(list(mean.response = dd$response),dd[v],mean)

# 3. sqldf
library(sqldf)
sqldf("select block,type,avg(response) as [mean.response]
  from dd group by 1,4")

# 4. data.table
library(data.table)
v <- c("block","type")
as.data.table(dd)[,.(mean.response = mean(response)),by = v]

# 5. doBy - last column of output will be labelled response.mean
library(doBy)
summaryBy(response ~ block + plot + species + type,dd)

注意

可复制形式的输入：

dd <- structure(list(block = c(1L,-27L))