有没有更好的方法来刮R中的Wikipedia页面?

问题描述

我正在使用一个包含美国各州的数据集,现在尝试抓取Wikipedia页面“美国州长名单”以区分民主国家和共和党国家。

到目前为止,我的代码如下:

library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(rvest)
library(htmltab)
library(lubridate)

corona_usa_simple <- readr::read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/us_simplified.csv")

corona_us_states <- corona_usa_simple %>% 
select(- FIPS,- Admin2,-`Country/Region`) %>%  rename(State=`Province/State`)

wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors") %>% rename(State=`Democratic(24)  Republican(26) >> State`)

因此,现在在合并数据集之前,我想重命名第一列,以使其在两个数据集中都显示为“状态”。但是不知何故,我收到一条错误消息:“无法重命名不存在的列。” 是否有一种更好的方式来刮除Wiki页面,以使并非每一列都以“ Democratic(24)Republican(26)”开头?

解决方法

您可以在header调用中指定htmltab()列。这将正确命名列,但在第一行中包含“ Democratic(24)Republican(26)”。要删除它,请使用slice(-1)中的dplyr

wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors",header = 2) %>% slice(-1)

结果数据:

head(wiki_governors)

       State       Governor Party    Party.1                       Born
1    Alabama       Kay Ivey      Republican October 15,1944 (age 75)
2     Alaska  Mike Dunleavy      Republican      May 5,1961 (age 59)
3    Arizona     Doug Ducey      Republican    April 9,1964 (age 56)
4   Arkansas Asa Hutchinson      Republican December 3,1950 (age 69)
5 California   Gavin Newsom      Democratic October 10,1967 (age 52)
6   Colorado    Jared Polis      Democratic     May 12,1975 (age 45)
                                                                                                                                     Prior public experience
1                                                                                                                             Lieutenant Governor,Treasurer
2                                                                                                                                              Alaska Senate
3                                                                                                                                                  Treasurer
4 Under Secretary of Homeland Security for Border & Transportation Security,Administrator of the Drug Enforcement Administration,U.S. House,U.S. Attorney
5                                                                                                                Lieutenant Governor,Mayor of San Francisco
6                                                                                                              U.S. House,Colorado State Board of Education
      Inauguration        End of term Past governors
1   April 10,2017               2023           List
2 December 3,2018               2022           List
3  January 5,2015 2023 (term limits)           List
4 January 13,2015 2023 (term limits)           List
5  January 7,2019               2023           List
6  January 8,2019               2023           List

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...