问题描述
我正在使用 R 中的 rvest 使用以下代码从本文页面中抓取文本关键字:
#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management
#' start with get the link for the web to be scraped
page <- read_html("https://www.sciencedirect.com/science/article/pii/S1877042810004568")
keyW <- page %>% html_nodes("div.Keywords.u-font-serif") %>% html_text() %>% paste(collapse = ",")
它给了我:
> keyW
[1] "KeywordsPhysics curriculumTurkish education systemfinnish education systemPISAphysics achievement"
使用这行代码从字符串中删除单词“Keywords”及其之前的任何内容后:
keyW <- gsub(".*Keywords","",keyW)
新的keyW是:
[1] "Physics curriculumTurkish education systemfinnish education systemPISAphysics achievement"
但是,我想要的输出是这个列表:
[1] "Physics curriculum" "Turkish education system" "finnish education system" "PISA" "physics achievement"
我应该如何解决这个问题?我认为这归结为:
- 如何正确地从网站上抓取关键字
- 如何正确拆分字符串
谢谢
解决方法
如果您使用 span
标记提取单词,您将直接获得预期的输出。
library(rvest)
page %>% html_nodes("div.Keywords span") %>% html_text()
#[1] "Physics curriculum" "Turkish education system" "finnish education system"
#[4] "PISA" "physics achievement"