问题描述
数据
data.frame(id = c(1,2),text = c("something here <h1>my text</h1> also <h1>Keep it</h1>","<h1>title</h1> another here"))
如何在标签<h1>my text</h1>
之后保留文本,直到找到标签的下一个开始,并且如果行中不存在该标签,则插入0
示例输出
data.frame(id = c(1,text = c("also",0))
解决方法
在正则表达式中,您可以使用先行和后备,请参阅Springboot plugin reference了解更多信息。命名数据df
:
df$text <- str_extract(df$text,pattern = "(?<=</h1>)(.*)(?=<h1>)")
ifelse(is.na(df$text),"0",trimws(df$text))
[1] "also" "0"
,
您可以使用几个corpus_select()
调用,在 quanteda 中进行此操作:
df <- data.frame(
id = c(1,2),text = c(
"something here <h1>my text</h1> also <h1>Keep it</h1>","<h1>title</h1> another here"
)
)
library("quanteda",warn.conflicts = FALSE)
## Package version: 2.1.1
corp <- df %>%
corpus(docid_field = "id") %>%
corpus_segment("<h1>my text</h1>",pattern_position = "before") %>%
corpus_segment("<h1>",pattern_position = "after")
现在,我们可以通过将其与ID序列合并并将所有不匹配项(NA
s转换为0)来获得0:
library("dplyr",warn.conflicts = FALSE)
convert(corp,to = "data.frame") %>%
rename(id = doc_id) %>%
select(id,text) %>%
mutate(id = as.integer(id)) %>%
right_join(data.frame(id = 1:2)) %>%
tidyr::replace_na(list(text = 0))
## Joining,by = "id"
## id text
## 1 1 also
## 2 2 0