R：计算所选列非空的不同 ID

问题描述

我有以下数据框：

user_id <- c(97,97,96,95,94,94)
event_id <- c(42,15,43,12,44,32,38,10,11)
plan_id <- c(NA,NA,30,25)
treatment_id <- c(NA,20,28,41,17,32)
system <- c(1,1,2,NA)

df <- data.frame(user_id,event_id,plan_id,treatment_id system)

我想为每列计算 user_id 的不同数量，不包括 NA 值。我希望的输出是：

      user_id   event_id    plan_id   treatment_id  system
  1   4         4           3         4             2

我尝试利用 mutate_all，但没有成功，因为我的数据框太大。在其他函数中，我使用了以下两行代码来获取每列的非空计数和不同的计数：

colSums(!is.empty(df[,]))
apply(df[,],function(x) length(unique(x)))

最理想的情况是，我想将两者与 ifelse 结合起来以最小化突变，因为这最终将被扔进一个函数中，以与许多其他汇总统计数据一起应用于数据框列表。

我尝试了一种蛮力方法，如果不为 null，则将值设为 1，否则设为 0，然后将 id 复制到该列，如果为 1。然后我可以使用上面的 count 不同行来获取我的输出。但是，将其复制到其他列中时我得到了错误的值，并且调整次数不是最佳的。见代码：

binary <- cbind(df$user_id,!is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0,binary[.,1])

非常感谢您的帮助。

解决方法

1：基础

sapply(df,function(x){
    length(unique(df$user_id[!is.na(x)]))
})
#     user_id     event_id      plan_id treatment_id       system 
#           4            4            3            3            2

2：基础

aggregate(user_id ~ ind,unique(na.omit(cbind(stack(df),df[1]))[-1]),length)
#           ind user_id
#1      user_id       4
#2     event_id       4
#3      plan_id       3
#4 treatment_id       3
#5       system       2

3：tidyverse

df %>%
    mutate(key = user_id) %>%
    pivot_longer(!key) %>%
    filter(!is.na(value)) %>%
    group_by(name) %>%
    summarise(value = n_distinct(key)) %>%
    pivot_wider()
## A tibble: 1 x 5
#  event_id plan_id system treatment_id user_id
#     <int>   <int>  <int>        <int>   <int>
#1        4       3      2            3       4

谢谢@dcarlson 我误解了这个问题：

   apply(df,2,function(x){length(unique(df[!is.na(x),1]))})

带有 data.table 的 uniqueN 选项

> setDT(df)[,lapply(.SD,function(x) uniqueN(user_id[!is.na(x)]))]
   user_id event_id plan_id treatment_id system
1:       4        4       3            3      2

使用 dplyr，您可以将 summarise 与 across 一起使用：

library(dplyr)
df %>% summarise(across(.fns =  ~n_distinct(user_id[!is.na(.x)])))

#  user_id event_id plan_id treatment_id system
#1       4        4       3            3      2

distinct non-nullable r r