基于if中的字符串部分内容使用ifelse的新列

问题描述

我试图根据现有列ref_gene_name内容数据库ref_transcript_name中创建新列。 ref_transcript_name中的某些条目包含变体信息(例如“,transcript变体X1”),但是变体编号在整个数据集中都会发生变化(例如X1,X2等)。我想创建一个新列,以在脚本变体字符串(如果存在变体)之前包括所有内容,或者仅在ref_transcript_name列中打印ref_gene_name。我的计划是使用ifelse()语句,但是由于变体编号中“ X”后面的数字有所变化,因此我无法使其正常工作。

这就是我想要产生的:

enter image description here

这是我的数据的子集:

> dput(test)
structure(list(ref_gene_id = c(NA,NA,"LOC108906575","LOC108906574","LOC108906571","LOC108906589","LOC108906588","LOC108906578","LOC108906579"),qry_gene_id = structure(c(1L,7L,8L,9L,10L,11L,12L,13L,14L,2L,3L,4L,5L,6L),.Label = c("G1","G10","G11","G12","G13","G14","G2","G3","G4","G5","G6","G7","G8","G9"),class = "factor"),ref_transcript_id = c("unkNown_transcript_1","unkNown_transcript_1","XM_018709876.1","XM_018709875.1","XM_018709871.1","XM_018709894.2","XM_018709891.1","XM_018709878.1","XM_018709879.1","XM_018709881.2"),qry_transcript_id = structure(c(1L,19L,20L,21L,22L,23L,24L,25L,6L,15L,16L,17L,18L),.Label = c("TU1","TU10","TU11","TU12","TU13","TU14","TU15","TU16","TU17","TU18","TU19","TU2","TU20","TU21","TU22","TU23","TU24","TU25","TU3","TU4","TU5","TU6","TU7","TU8","TU9"),ref_transcript_name = structure(c(NA,1L,5L),.Label = c("ephrin type-B receptor 1-B,transcript variant X2","fork head domain transcription factor slp2-like","peroxisomal biogenesis factor 19,transcript variant X1","ribosomal RNA processing protein 1 homolog","uncharacterized LOC108906571","uncharacterized LOC108906589"),class = "factor")),row.names = c(NA,25L),class = "data.frame")

解决方法

您需要一个正则表达式,而不是ifelse。以下内容用于寻找:

  • ,?可选逗号
  • *(星号)零个或多个空格(逗号后)
  • transcript variant X,原义文字
  • [0-9]*零个或多个数字;如果始终至少有一个数字
  • ,您还可以使用[0-9]+(用加号代替星号)

,并将所有内容替换为空字符串""(即,从每个字符串中删除它)。

注意:正则表达式功能强大且令人困惑。如果做得不好,它们可能会过于贪婪,并且更改/删除的方式可能会超出预期。一种正则表达式策略应尽可能具体。在这种情况下,[0-9]*(零个或多个)与[0-9]+(一个或多个)的差别很小。作为反例,如果transcript variant文本必须在逗号后面(并且不应调整以其开头的句子),则可以更改,?,。想法。

zz$ref_gene_name <- sub(",? *transcript variant X[0-9]*","",as.character(zz$ref_transcript_name))
zz[,5:6]
#                                        ref_transcript_name                                   ref_gene_name
# 1                                                     <NA>                                            <NA>
# 2                                                     <NA>                                            <NA>
# 3                                                     <NA>                                            <NA>
# 4                                                     <NA>                                            <NA>
# 5                                                     <NA>                                            <NA>
# 6                                                     <NA>                                            <NA>
# 7          fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 8                                                     <NA>                                            <NA>
# 9          fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 10                            uncharacterized LOC108906571                    uncharacterized LOC108906571
# 11                            uncharacterized LOC108906589                    uncharacterized LOC108906589
# 12                            uncharacterized LOC108906589                    uncharacterized LOC108906589
# 13       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 14       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 15       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 16       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 17       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 18       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 19       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 20       ephrin type-B receptor 1-B,transcript variant X2                      ephrin type-B receptor 1-B
# 21                                                    <NA>                                            <NA>
# 22                                                    <NA>                                            <NA>
# 23 peroxisomal biogenesis factor 19,transcript variant X1                peroxisomal biogenesis factor 19
# 24 peroxisomal biogenesis factor 19,transcript variant X2                peroxisomal biogenesis factor 19
# 25              ribosomal RNA processing protein 1 homolog      ribosomal RNA processing protein 1 homolog
,

在这种情况下,另一个正则表达式模式可以解决您的问题。

library(stringr)
test %>% 
  mutate(ref_gene_name = str_replace(ref_transcript_name,regex(",transcript variant X\\d{1,}"),""))
ref_transcript_name                                   ref_gene_name
# 1                                                     <NA>                                            <NA>
# 2                                                     <NA>                                            <NA>
# 3                                                     <NA>                                            <NA>
# 4                                                     <NA>                                            <NA>
# 5                                                     <NA>                                            <NA>
# 6                                                     <NA>                                            <NA>
# 7          fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 8                                                     <NA>                                            <NA>
# 9          fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 10                            uncharacterized LOC108906571                    uncharacterized LOC108906571
# 11                            uncharacterized LOC108906589                    uncharacterized LOC108906589
# 12                            uncharacterized LOC108906589                    uncharacterized LOC108906589
# 13       ephrin type-B receptor 1-B,transcript variant X2                peroxisomal biogenesis factor 19
# 25              ribosomal RNA processing protein 1 homolog      ribosomal RNA processing protein 1 homolog