问题描述
在标题(或谷歌)中简洁地描述这是一个棘手的问题。我有一个分类表,其中某些列可能会根据置信度被列为“已删除”。我想将任何显示“已删除”的列替换为“未识别”,然后是第一列中没有的值以行方式说“dropped”。因此,输入将如下所示:
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota dropped dropped dropped
#> 2 Eukaryota dropped dropped dropped
#> 3 Eukaryota dropped dropped dropped
#> 4 Eukaryota dropped dropped dropped
#> 5 Eukaryota dropped dropped dropped
#> 6 Eukaryota dropped dropped dropped
#> 7 Eukaryota Hexanauplia Calanoida dropped
#> 8 Eukaryota dropped dropped dropped
#> 9 Eukaryota Dinophyceae Syndiniales dropped
#> 10 Animals polychaeta Terebellida dropped
#> 11 Eukaryota Acantharia Chaunacanthida dropped
#> 12 Eukaryota dropped dropped dropped
#> 13 Animals Ascidiacea Stolidobranchia dropped
#> 14 Eukaryota Haptophyta dropped dropped
#> 15 Eukaryota dropped dropped dropped
#> 16 Eukaryota dropped dropped dropped
#> 17 Eukaryota dropped dropped dropped
#> 18 Animals Ascidiacea Stolidobranchia dropped
#> 19 Eukaryota dropped dropped dropped
#> 20 Eukaryota dropped dropped dropped
输出应该是这样的:
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida Unidentified Calanoida
#> 8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales Unidentified Syndiniales
#> 10 Animals polychaeta Terebellida Unidentified Terebellida
#> 11 Eukaryota Acantharia Chaunacanthida Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 13 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta Unidentified Haptop… Unidentified Haptophyta
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 18 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
我使用 purrr::pmap_dfr
想出了一个很好的解决方案,但我很想知道是否有更“纯粹”的 dplyr
方法来做到这一点?我的方法的一个缺陷是它不适用于第一个非“删除”列在一个或多个“删除”列之后的列(参见下面输出中的第 21 行)。这是我目前的解决方案:
library(tidyverse)
otu_table <- structure(list(domain = c("Eukaryota","Eukaryota","Animals","dropped"),class = c("dropped","dropped","Hexanauplia","Dinophyceae","polychaeta","Acantharia","Ascidiacea","Haptophyta","not dropped"),order = c("dropped","Calanoida","Syndiniales","Terebellida","Chaunacanthida","Stolidobranchia",species = c("dropped","dropped")),row.names = c(NA,-21L),class = c("tbl_df","tbl","data.frame"))
tax_data <- otu_table %>%
pmap_dfr(~{
items <- list(...)
first_dropped = match("dropped",items)
if (first_dropped > 1) {
dropped_name <- str_c("Unidentified ",items[first_dropped-1])
} else {
dropped_name <- "Unidentified"
}
items[-c(1:first_dropped-1)] <- dropped_name
items
})
print(tax_data,n=30)
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida Unidentified Calanoida
#> 8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales Unidentified Syndiniales
#> 10 Animals polychaeta Terebellida Unidentified Terebellida
#> 11 Eukaryota Acantharia Chaunacanthida Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 13 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta Unidentified Haptop… Unidentified Haptophyta
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 18 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 21 dropped not dropped dropped dropped
更新:
下面有一些不错的答案。我接受了点赞最多的那个,但结果证明在通过 microbenchmark
运行所有建议后,purrr
解决方案的速度快了几乎一个数量级。
解决方法
我认为这的执行时间相当不错,但是,您可以自己尝试一下。我要感谢@IRTFM 关于将 dropped
值更改为 NA
的评论。我实际上使用了这个想法,但我决定更倾向于 dplyr
而不是 zoo
,因此我使用 na.locf
而不是 coalesce
用于此目的。
library(dplyr)
library(tidyr)
otu_table %>%
mutate(across(!domain,~ replace(.x,.x == "dropped",NA))) %>%
rowwise() %>%
mutate(output = list(coalesce(c_across(everything()),str_c("Unidentified",last(c_across(everything())[!is.na(c_across(everything()))]),sep = " ")))) %>%
select(output) %>%
unnest_wider(output) %>%
set_names(colnames(otu_table))
# A tibble: 21 x 4
domain class order species
<chr> <chr> <chr> <chr>
1 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota
2 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota
3 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota
4 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota
5 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota
6 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota
7 Eukaryota Hexanauplia Calanoida Unidentified Calanoida
8 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota
9 Eukaryota Dinophyceae Syndiniales Unidentified Syndiniales
10 Animals Polychaeta Terebellida Unidentified Terebellida
# ... with 11 more rows
,
这是另一种方法,将 rowwise()
与 across()
结合使用。
- 我们使用
rowwise
是因为它有助于通过cur_data()
将一行用作单个向量 -
across(everything(),~)
帮助我们一次性改变所有列 -
max.col(cur_data() != 'dropped',ties.method = 'last')
将检索值为!= 'dropped'
的最后一列索引
- 我们将其列名存储在一个临时变量中,例如
x
- 最后,我们使用来自基 R 的
if()..else
只改变那些值为dropped
的列
希望答案足够清楚
library(tidyverse)
otu_table %>% rowwise() %>%
mutate(across(everything(),~ {x<- names(cur_data())[max.col(cur_data() != 'dropped',ties.method = 'last')];
if (. == 'dropped') paste0('unidentified ',get(x)) else . }))
#> # A tibble: 21 x 4
#> # Rowwise:
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 2 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 3 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 4 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 5 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 6 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida unidentified Calanoida
#> 8 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales unidentified Syndinial~
#> 10 Animals Polychaeta Terebellida unidentified Terebelli~
#> # ... with 11 more rows
由 reprex package (v2.0.0) 于 2021 年 6 月 19 日创建
,这是使用 dplyr + tidyr::pivot_longer/wider 的方法。我认为它读起来很干净,但肯定有更简洁的方法。
otu_table %>%
mutate(across(class:species,~if_else(.x == "dropped",NA_character_,.x))) %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
group_by(row) %>%
mutate(value = if_else(is.na(value) & !is.na(lag(value)),paste("Unidentified",lag(value)),value)) %>%
fill(value) %>%
ungroup() %>%
pivot_wider(names_from = name,values_from = value)