如何仅保留特定标签后的文本并插入其他行0

问题描述

数据

data.frame(id = c(1,2),text = c("something here <h1>my text</h1> also <h1>Keep it</h1>","<h1>title</h1> another here"))

如何在标签<h1>my text</h1>之后保留文本，直到找到标签的下一个开始，并且如果行中不存在该标签，则插入0

示例输出

data.frame(id = c(1,text = c("also",0))

解决方法

在正则表达式中，您可以使用先行和后备，请参阅Springboot plugin reference了解更多信息。命名数据df：

df$text <- str_extract(df$text,pattern = "(?<=</h1>)(.*)(?=<h1>)")
ifelse(is.na(df$text),"0",trimws(df$text))

[1] "also" "0"

您可以使用几个corpus_select()调用，在 quanteda 中进行此操作：

df <- data.frame(
  id = c(1,2),text = c(
    "something here <h1>my text</h1> also <h1>Keep it</h1>","<h1>title</h1> another here"
  )
)

library("quanteda",warn.conflicts = FALSE)
## Package version: 2.1.1

corp <- df %>%
  corpus(docid_field = "id") %>%
  corpus_segment("<h1>my text</h1>",pattern_position = "before") %>%
  corpus_segment("<h1>",pattern_position = "after")

现在，我们可以通过将其与ID序列合并并将所有不匹配项（NA s转换为0）来获得0：

library("dplyr",warn.conflicts = FALSE)
convert(corp,to = "data.frame") %>%
  rename(id = doc_id) %>%
  select(id,text) %>%
  mutate(id = as.integer(id)) %>%
  right_join(data.frame(id = 1:2)) %>%
  tidyr::replace_na(list(text = 0))
## Joining,by = "id"
##   id text
## 1  1 also
## 2  2    0

quanteda r r

如何仅保留特定标签后的文本并插入其他行0

问题描述

解决方法

相关问答