使用R提取包含一组单词的句子

问题描述

我正在寻找搬迁总部的公司。他们通常会在SEC文件中披露这些细节。例如,他们的文件有以下文字

text <- "**<PAGE>   13
As a result of recurring losses in the UK operation,the Board of Directors
announced,during the first quarter of fiscal year 1999,the approval of a plan
to wind-down and discontinue the UK operation. The wind-down was completed in
May 1999. In addition,on September 30,1998,the Company relocated its
corporate headquarters from
Wayne,Pennsylvania to Orlando,Florida. As a result of the wind-down of the UK operation and
the relocation of the corporate headquarters,the Company recorded charges of
approximately $3.5 million during fiscal year 1999. These charges primarily
relate to employee termination benefits and lease termination costs.**"

我正在尝试提取在同一句子中包含单词“ relocat”和“总部”的句子。在这种情况下,句子为“ [1]”。此外,在1998年9月30日,公司将公司总部从宾夕法尼亚州的韦恩搬到了佛罗里达州的奥兰多。”和“ [2 ]由于英国业务的停业和公司总部的搬迁,公司在1999财政年度的费用约为350万美元。

我尝试使用grepl和gsub。但是grepl仅返回True或False,而gsub返回整个文本。您能帮我提取这两个句子吗?以下是我使用的grepl和gsub语句。谢谢。

grepl("relocat[^\\.,!?:;]*headquarter|headquarter[^\\.,!?:;]*relocat",text)

gsub(".*?([^\\.]*(relocat*headquarter|headquarter*relocat)[^\\.]*).*","\\1",text,ignore.case=T,fixed=F)

解决方法

texts <- strsplit(text,"\\.[[:space:]]+")[[1]]
texts
# [1] "**<PAGE>   13\nAs a result of recurring losses in the UK operation,the Board of Directors\nannounced,during the first quarter of fiscal year 1999,the approval of a plan\nto wind-down and discontinue the UK operation"
# [2] "The wind-down was completed in\nMay 1999"                                                                                                                                                                                  
# [3] "In addition,on September 30,1998,the Company relocated its\ncorporate headquarters from\nWayne,Pennsylvania to Orlando,Florida"                                                                                       
# [4] "As a result of the wind-down of the UK operation and\nthe relocation of the corporate headquarters,the Company recorded charges of\napproximately $3.5 million during fiscal year 1999"                                   
# [5] "These charges primarily\nrelate to employee termination benefits and lease termination costs.**"                                                                                                                           

texts[grepl("headquarter",texts) &
        rowSums(t(outer(c("record","relocat"),texts,Vectorize(grepl),ignore.case = TRUE))) > 0]
# [1] "In addition,Florida"                                                    
# [2] "As a result of the wind-down of the UK operation and\nthe relocation of the corporate headquarters,the Company recorded charges of\napproximately $3.5 million during fiscal year 1999"