通过关键字将这样的文本子集化的最佳方法是什么？

问题描述

我有一个包含N个长文本的数据框。我想做的是以最有效的方式，根据一些关键短语来提取这些文本的子集。

让我举个例子。 df是一个只有一个长文本的数据框，如下所示：

df = data.frame("Based on our regular economic and monetary analyses,we decided to keep the key ECB interest rates unchanged. We continue to expect them to remain at present or lower levels for an extended period of time,and well past the horizon of our net asset purchases. Regarding non-standard monetary policy measures,we confirm that the monthly asset purchases of €80 billion are intended to run until the end of march 2017,or beyond,if necessary,and in any case until the Governing Council sees a sustained adjustment in the path of inflation consistent with its inflation aim. Today,we assessed the economic and monetary data which had become available since our last meeting and discussed the new ECB staff macroeconomic projections. Overall,while the available evidence so far suggests resilience of the euro area economy to the continuing global economic and political uncertainty,our baseline scenario remains subject to downside risks. Our comprehensive policy measures continue to ensure supportive financing conditions and underpin the momentum of the euro area economic recovery. As a result,we continue to expect real GDP to grow at a moderate but steady pace and euro area inflation to rise gradually over the coming months,in line with the path already implied in our June 2016 staff projections. The Governing Council will continue to monitor economic and financial market developments very closely. We will preserve the very substantial amount of monetary support that is embedded in our staff projections and that is necessary to secure a return of inflation to levels below,but close to,2% over the medium term. If warranted,we will act by using all the instruments available within our mandate. Meanwhile,the Governing Council tasked the relevant committees to evaluate the options that ensure a smooth implementation of our purchase programme. Let me Now explain our assessment in greater detail,starting with the economic analysis. Euro area real GDP increased by 0.3%,quarter on quarter,in the second quarter of 2016,after 0.5% in the first quarter. Incoming data point to ongoing growth in the third quarter of 2016,at around the same rate as in the second quarter. Looking ahead,we continue to expect the economic recovery to proceed at a moderate but steady pace. Domestic demand remains supported by the pass-through of our monetary policy measures to the real economy. Favourable financing conditions and improvements in the demand outlook and in corporate profitability continue to promote a recovery in investment. Sustained employment gains,which are also benefiting from past structural reforms,and still relatively low oil prices provide additional support for households’ real disposable income and thus for private consumption. In addition,the fiscal stance in the euro area is expected to be mildly expansionary in 2016 and to turn broadly neutral in 2017 and 2018. However,the economic recovery in the euro area is expected to be dampened by still subdued foreign demand,partly related to the uncertainties following the UK referendum outcome,the necessary balance sheet adjustments in a number of sectors and a sluggish pace of implementation of structural reforms. The risks to the euro area growth outlook remain tilted to the downside and relate mainly to the external environment. This assessment is broadly reflected in the September 2016 ECB staff macroeconomic projections for the euro area,which foresee annual real GDP increasing by 1.7% in 2016,by 1.6% in 2017 and by 1.6% in 2018. Compared with the June 2016 Eurosystem staff macroeconomic projections,the outlook for real GDP growth has been revised downwards slightly. According to Eurostat’s flash estimate,euro area annual HICP inflation in August 2016 was 0.2%,unchanged from July. While annual energy inflation continued to rise,services and non-energy industrial goods inflation turned out to be slightly lower than in July. Looking ahead,on the basis of current oil futures prices,inflation rates are likely to remain low over the next few months before starting to pick up towards the end of 2016,in large part owing to base effects in the annual rate of change of energy prices. Supported by our monetary policy measures and the expected economic recovery,inflation rates should increase further in 2017 and 2018.",stringsAsFactors = F)

我想得到的是一个仅包含此文本子集的数据框或语料库；特别是，我将获得这一部分：


aim = data.frame("Today,our baseline scenario remains subject to downside risks. Let me Now explain our assessment in greater detail,the outlook for real GDP growth has been revised downwards slightly.",stringsAsFactors = F)

上面的文本介于一些关键短语之间，例如：Today,we assessed the economic and monetary data和in line with the path already implied in our June 2016 staff projections.； Let me Now explain our assessment in greater detail,starting with the economic analysis.和the outlook for real GDP growth has been revised downwards slightly.

我希望有一种灵活而有效的方法来一次捕获多个短语之间的文本。

有人可以帮助我吗？

非常感谢！

解决方法

您可以两次使用gregexpr函数来标识字符串中两个子字符串的位置，然后使用substr函数来提取字符串在这两个“边缘”之间的部分。 / p>

在您的示例中，您在data.frame中使用了一个非常长的字符串。为了简化演示，我使用一个简短的示例，并定义了一个名为extract_between的新函数来为您完成艰苦的工作：

extract_between <- function(text,left,right) {
  left <- gregexpr(text,pattern=left,fixed=TRUE)[[1]] + nchar(left)
  right <- gregexpr(text,pattern=right,fixed=TRUE)[[1]] - 1
  substr(text,right)
}

x <- "Today,we assessed the economic and monetary data"

extract_between(x,left="Today,we ",right=" data")
#> [1] "assessed the economic and monetary"

corpus dataframe quanteda r r token token