问题描述
我正在使用一个包含美国各州的数据集,现在尝试抓取Wikipedia页面“美国州长名单”以区分民主国家和共和党国家。
到目前为止,我的代码如下:
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(rvest)
library(htmltab)
library(lubridate)
corona_usa_simple <- readr::read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/us_simplified.csv")
corona_us_states <- corona_usa_simple %>%
select(- FIPS,- Admin2,-`Country/Region`) %>% rename(State=`Province/State`)
wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors") %>% rename(State=`Democratic(24) Republican(26) >> State`)
因此,现在在合并数据集之前,我想重命名第一列,以使其在两个数据集中都显示为“状态”。但是不知何故,我收到一条错误消息:“无法重命名不存在的列。” 是否有一种更好的方式来刮除Wiki页面,以使并非每一列都以“ Democratic(24)Republican(26)”开头?
解决方法
您可以在header
调用中指定htmltab()
列。这将正确命名列,但在第一行中包含“ Democratic(24)Republican(26)”。要删除它,请使用slice(-1)
中的dplyr
。
wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors",header = 2) %>% slice(-1)
结果数据:
head(wiki_governors)
State Governor Party Party.1 Born
1 Alabama Kay Ivey  Republican October 15,1944 (age 75)
2 Alaska Mike Dunleavy  Republican May 5,1961 (age 59)
3 Arizona Doug Ducey  Republican April 9,1964 (age 56)
4 Arkansas Asa Hutchinson  Republican December 3,1950 (age 69)
5 California Gavin Newsom  Democratic October 10,1967 (age 52)
6 Colorado Jared Polis  Democratic May 12,1975 (age 45)
Prior public experience
1 Lieutenant Governor,Treasurer
2 Alaska Senate
3 Treasurer
4 Under Secretary of Homeland Security for Border & Transportation Security,Administrator of the Drug Enforcement Administration,U.S. House,U.S. Attorney
5 Lieutenant Governor,Mayor of San Francisco
6 U.S. House,Colorado State Board of Education
Inauguration End of term Past governors
1 April 10,2017 2023 List
2 December 3,2018 2022 List
3 January 5,2015 2023 (term limits) List
4 January 13,2015 2023 (term limits) List
5 January 7,2019 2023 List
6 January 8,2019 2023 List