问题描述
我正在尝试生成一个基于mblm包的自定义函数生成的回归斜率表(此处示例中的函数为简化版本)。该函数需要一个公式作为参数,我想使用dplyr summary将其应用于具有多个变量的大型数据框中的分组样本。 输出应该是样本组和我可以传递给热图函数的响应变量的回归斜率。
library (dplyr)
# Example data
test_data <-
rbind (
data.frame(ID=paste0("someName",c(1:9)),Sample_Type="type1",A=seq(1,17,length.out=9),I=0.1^seq(1,1.8,J=1-0.1^seq(1,length.out=9)),data.frame(ID=paste0("someName",c(10:15)),Sample_Type="type2",7,length.out=6),I=0.1^(1-seq(1,1.5,length.out=6)),J=1-0.1^(1-seq(1,length.out=6))))
# Define an independent and the responding variables - I would like to be able to easily test different independent variables
idpVar <- "A"
respVar <- test_data %>% .[!names(.) %in% c("ID","Sample_Type",idpVar)] %>% names()
# Custom function generating numeric value of median slopes (simplified from mblm)
medianSlope <-
function (formula,dataframe)
{
if (missing(dataframe))
dataframe <- environment(formula)
term <- as.character(attr(terms(formula),"variables")[-1])
x = dataframe[[term[2]]]
y = dataframe[[term[1]]]
if (length(term) > 2) {
stop("Only linear models are accepted")
}
xx = sort(x)
yy = y[order(x)]
n = length(xx)
slopes = c()
smedians = c()
for (i in 1:n) {
slopes = c()
for (j in 1:n) {
if (xx[j] != xx[i]) {
slopes = c(slopes,(yy[j] - yy[i])/(xx[j] -
xx[i]))
}
}
smedians = c(smedians,median(slopes))
}
slope = median(smedians)
slope
}
# Custom function works with test dataframe and a single named dependent variable but "group_by" seems to be ignored:
test_data %>% group_by (Sample_Type) %>% medianSlope( formula(paste("J","~",idpVar)),.)
暂时不考虑分组问题,我试图通过生成多个公式的列表来使“汇总”工作:
粘贴(respVar,“〜”,idpVar) [1]“ B〜A”“ C〜A”“ D〜A”“ E〜A”“ F〜A”“ G〜A”“ H〜A”“ I〜A”“ J〜A”“ K 〜A“” L〜A“
但是
test_data%>%summarise_at(respVar,中位数斜率(粘贴(respVar,“〜”,idpVar),。))
错误:$运算符对于原子向量无效
test_data%>%summarise_at(respVar,位数Slope(paste(get(respVar),“〜”,get(idpVar)),。))
get(idpVar)错误:找不到对象'A'
我对R比较陌生,有点迷路。你能帮忙吗?
谢谢你,托马斯
解决方法
我不确定是否可以使用summarise_at
函数来完成此操作。但是,我们可以结合使用map_dbl
,by
和其他一些数据清除功能来执行计算:
library(tidyverse)
# split the data using `by` (acts as a group_by)
# use `map_dbl` to iterate over the variables in respVar
# we use setNames so that the returned vector from map_dbl is named
# then,bind the rows together,convert to data frame
# finally convert row names (groups) to a column
by(test_data,test_data$Sample_Type,FUN = function(d) map_dbl(setNames(respVar,respVar),~medianSlope(formula(paste(.x,"~",idpVar)),data = d))) %>%
do.call("rbind",.) %>%
as.data.frame() %>%
rownames_to_column(var = "Sample_Type")
Sample_Type I J
1 type1 -0.004623987 0.004623987
2 type2 0.341974269 -0.341974269
,
弹跳球,再次感谢您的帮助。似乎确实“摘要”和“变异”不能调用使用公式作为输入的函数,尽管我在其他地方都没有看到它的解释。解决方法是有启发性的,但我遵循了您的其他建议并重写了所调用的函数。 仍然是一名学习者,我面临挑战,要替换源自mblm的代码中的“ for”循环,并消除似乎多余的计算(以对RAM的更高要求为代价,但对于我来说,运行速度仍然要快得多)数据存储在我的PC上,并且我计划在开发代码的下一步中重新使用dx矩阵)。这两个解决方案如下。 干杯,托马斯
mblm_2_short <- # code adapted from mblm(y ~ x,repeated = T),for calculation of repeat median slope only
function (x,y)
{
xx = sort(x)
yy = y[order(x)]
n = length(xx)
slopes = c()
smedians = c()
for (i in 1:n) {
slopes = c()
for (j in 1:n) {
if (xx[j] != xx[i]) {
slopes = c(slopes,(yy[j] - yy[i])/(xx[j] -
xx[i]))
}
}
smedians = c(smedians,median(slopes))
}
slope = median(smedians)
}
。
med_slopesMed <- # repeat median slope- like mblm(y ~ x,slope only
function (xx,yy)
{
x = sort(xx)
y = yy[order(xx)]
n = length(x)
dx = matrix (rep (0,n^2),ncol=n)
dy = c()
z = matrix (rep (0,ncol=n)
for (i in 1:(n-1)) { ### x-axis distances (dx) and slopes (z) between points
dxi = x[-(1:i)]-x[i]
dx [i,(i+1):n] = dxi # for points 1:n,x-axis distances to all other points
dyi = y[-(1:i)]-y[i]
zi = dyi/dxi
z [i,(i+1):n] = zi # for points 1:n,linear slopes connecting with all other points
}
z = replace(z,is.infinite(z),NA) # removes +/-Inf and NaN generated by dx=0
z = t(z)[,-n] + z[,-1]
median (apply(z,1,median,na.rm=T))
}