问题描述
我试图根据现有列ref_gene_name
的内容在数据库ref_transcript_name
中创建新列。 ref_transcript_name
中的某些条目包含变体信息(例如“,transcript变体X1”),但是变体编号在整个数据集中都会发生变化(例如X1,X2等)。我想创建一个新列,以在脚本变体字符串(如果存在变体)之前包括所有内容,或者仅在ref_transcript_name
列中打印ref_gene_name
。我的计划是使用ifelse()
语句,但是由于变体编号中“ X”后面的数字有所变化,因此我无法使其正常工作。
这是我的数据的子集:
> dput(test)
structure(list(ref_gene_id = c(NA,NA,"LOC108906575","LOC108906574","LOC108906571","LOC108906589","LOC108906588","LOC108906578","LOC108906579"),qry_gene_id = structure(c(1L,7L,8L,9L,10L,11L,12L,13L,14L,2L,3L,4L,5L,6L),.Label = c("G1","G10","G11","G12","G13","G14","G2","G3","G4","G5","G6","G7","G8","G9"),class = "factor"),ref_transcript_id = c("unkNown_transcript_1","unkNown_transcript_1","XM_018709876.1","XM_018709875.1","XM_018709871.1","XM_018709894.2","XM_018709891.1","XM_018709878.1","XM_018709879.1","XM_018709881.2"),qry_transcript_id = structure(c(1L,19L,20L,21L,22L,23L,24L,25L,6L,15L,16L,17L,18L),.Label = c("TU1","TU10","TU11","TU12","TU13","TU14","TU15","TU16","TU17","TU18","TU19","TU2","TU20","TU21","TU22","TU23","TU24","TU25","TU3","TU4","TU5","TU6","TU7","TU8","TU9"),ref_transcript_name = structure(c(NA,1L,5L),.Label = c("ephrin type-B receptor 1-B,transcript variant X2","fork head domain transcription factor slp2-like","peroxisomal biogenesis factor 19,transcript variant X1","ribosomal RNA processing protein 1 homolog","uncharacterized LOC108906571","uncharacterized LOC108906589"),class = "factor")),row.names = c(NA,25L),class = "data.frame")
解决方法
您需要一个正则表达式,而不是ifelse
。以下内容用于寻找:
-
,?
可选逗号 -
*
(星号)零个或多个空格(逗号后) -
transcript variant X
,原义文字 -
[0-9]*
零个或多个数字;如果始终至少有一个数字 ,您还可以使用
[0-9]+
(用加号代替星号)
,并将所有内容替换为空字符串""
(即,从每个字符串中删除它)。
注意:正则表达式功能强大且令人困惑。如果做得不好,它们可能会过于贪婪,并且更改/删除的方式可能会超出预期。一种正则表达式策略应尽可能具体。在这种情况下,[0-9]*
(零个或多个)与[0-9]+
(一个或多个)的差别很小。作为反例,如果transcript variant
文本必须在逗号后面(并且不应调整以其开头的句子),则可以更改,?
到,
。想法。
zz$ref_gene_name <- sub(",? *transcript variant X[0-9]*","",as.character(zz$ref_transcript_name))
zz[,5:6]
# ref_transcript_name ref_gene_name
# 1 <NA> <NA>
# 2 <NA> <NA>
# 3 <NA> <NA>
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 <NA> <NA>
# 7 fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 8 <NA> <NA>
# 9 fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 10 uncharacterized LOC108906571 uncharacterized LOC108906571
# 11 uncharacterized LOC108906589 uncharacterized LOC108906589
# 12 uncharacterized LOC108906589 uncharacterized LOC108906589
# 13 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 14 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 15 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 16 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 17 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 18 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 19 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 20 ephrin type-B receptor 1-B,transcript variant X2 ephrin type-B receptor 1-B
# 21 <NA> <NA>
# 22 <NA> <NA>
# 23 peroxisomal biogenesis factor 19,transcript variant X1 peroxisomal biogenesis factor 19
# 24 peroxisomal biogenesis factor 19,transcript variant X2 peroxisomal biogenesis factor 19
# 25 ribosomal RNA processing protein 1 homolog ribosomal RNA processing protein 1 homolog
,
在这种情况下,另一个正则表达式模式可以解决您的问题。
library(stringr)
test %>%
mutate(ref_gene_name = str_replace(ref_transcript_name,regex(",transcript variant X\\d{1,}"),""))
ref_transcript_name ref_gene_name
# 1 <NA> <NA>
# 2 <NA> <NA>
# 3 <NA> <NA>
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 <NA> <NA>
# 7 fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 8 <NA> <NA>
# 9 fork head domain transcription factor slp2-like fork head domain transcription factor slp2-like
# 10 uncharacterized LOC108906571 uncharacterized LOC108906571
# 11 uncharacterized LOC108906589 uncharacterized LOC108906589
# 12 uncharacterized LOC108906589 uncharacterized LOC108906589
# 13 ephrin type-B receptor 1-B,transcript variant X2 peroxisomal biogenesis factor 19
# 25 ribosomal RNA processing protein 1 homolog ribosomal RNA processing protein 1 homolog