数据框：有条件地将特定值与其他列的所有值进行比较，并将结果向量存储在列表列中

问题描述

我正在尝试对模型（简化的lm(col_B ~ col_A)）进行一些模拟，为此，我想基于多个条件定义从旧x值采样的新x值。对于我的每个物种（1-20），如果它们遵循一定条件，我想生成一个包含col_B的所有值（与物种无关）的向量。从这些向量中，我想稍后再采样一个值作为“新” col_B值，以用作x值。我可以通过分别定义所有这些向量轻松地做到这一点：

# make up example data
dat <- data.frame("species" =  seq(1:20),"col_A" = runif(20,min=1000,max=2500),"col_B" = runif(20,min = 0,max = 1500),"maximum" = rep(2500,20))

# define vectors for all species
possible_1 <- dat$col_A[dat$col_A <= max(dat$maximum)-dat$col_B[1]]
possible_2 <- dat$col_A[dat$col_A <= max(dat$maximum)-dat$col_B[2]]
possible_3 <- dat$col_A[dat$col_A <= max(dat$maximum)-dat$col_B[3]]

# etc.

当然，我还可以提出一个循环，将向量存储在新的数据框中，等等。但是，我在考虑是否有可能将这些向量存储在我的原始数据框中的列表列中，以便从中进行采样稍后再将它们保存在dplyr和一条不错的管道中。另外，要创建向量，我需要将特定物种的col_B值与所有物种的所有col_A值进行比较-我不知道该怎么做。所以我的问题是：

如何为所有物种创建一个包含不同长度向量的列表列？
是否可以将一列中的特定值与另一列中的所有值进行比较？

所需的输出：

dat_end <- structure(list(species = 1:20,col_A = c(1201.07331767213,1248.07284446433,1721.88013594132,1811.97518436238,1957.70114450715,2003.58936993871,2017.67337811179,1835.36564861424,1297.55500191823,2309.16906765196,1811.72096473165,1041.0662824288,1890.41095413268,2180.55545398965,2254.29310277104,2146.93792897742,1086.34597295895,1633.36910132784,2027.77895331383,1044.20079500414),col_B = c(1316.56480673701,698.999502696097,486.406950862147,362.3069843743,774.72961822059,5.33419672865421,261.744535993785,252.763583441265,1466.9924180489,926.854150719009,28.8207863923162,1203.98436568212,669.935327139683,1270.13235166669,1010.53655776195,649.534532683901,1407.57358598057,1376.92801596131,701.711902976967,783.507982618175),maximum = c(2500,2500,2500),vector_column = c("vector_1","vector_2","vector_3","vector_4","vector_5","vector_6","vector_7","vector_8","vector_9","vector_10","vector_11","vector_12","vector_13","vector_14","vector_15","vector_16","vector_17","vector_18","vector_19","vector_20"),col_B_new = c("sample(vector_column,1)","sample(vector_column,1)")),class = "data.frame",row.names = c(NA,-20L))

感谢任何提示！

解决方法

library(tidyverse)

dat_end <- dat %>%
  rowwise() %>%
  mutate(vector_column = list(dat$col_A[dat$col_A <= maximum - col_B]),# using list function to store vectors in a data.frame
         helper = !is_empty(vector_column),# some of the vectors are empty so it is not possible to use sample
         col_B_new = ifelse(helper,sample(vector_column,1),NA),helper = NULL)

关于问题1，请查看以下内容是否是您想要的。

possible <- lapply(dat$col_B,function(B) {
  dat$col_A[dat$col_A <= max(dat$maximum) - B]
})

head(possible)
#[[1]]
# [1] 1970.354 1591.339 1927.753 1715.337 1204.146 1101.077 1193.729
# [8] 1589.677 1003.874 1930.309 2146.621 2115.754 1634.094 1613.931
#[15] 1809.539 1980.336 1820.073 1399.095
#
#[[2]]
# [1] 1970.354 1591.339 1927.753 1715.337 1204.146 1101.077 1193.729
# [8] 1589.677 1003.874 1930.309 2146.621 2115.754 2239.249 1634.094
#[15] 1613.931 1809.539 1980.336 1820.073 1399.095
#
#[[3]]
#[1] 1204.146 1101.077 1193.729 1003.874
#
#[[4]]
#[1] 1003.874
#
#[[5]]
#[1] 1101.077 1003.874
#
#[[6]]
# [1] 1970.354 1591.339 1927.753 1715.337 1204.146 1101.077 1193.729
# [8] 1589.677 1003.874 1930.309 2146.621 2115.754 2239.249 1634.094
#[15] 1613.931 1809.539 1980.336 1820.073 1399.095

对于问题2，答案是肯定的，这是可能的。实际上，上面的代码将dat$col_A与max(dat$maximum) - dat$col_B[1]，max(dat$maximum) - dat$col_B[2]等中的每一个进行比较。

数据

set.seed(2020)
dat <- data.frame("species" =  1:20,"col_A" = runif(20,min=1000,max=2500),"col_B" = runif(20,min = 0,max = 1500),"maximum" = rep(2500,20))

data-manipulation dplyr r

数据框：有条件地将特定值与其他列的所有值进行比较，并将结果向量存储在列表列中

问题描述

解决方法

相关问答