将多个表从一个 tsv 文件读取到 R 数据帧

问题描述

我想在 R 中从 github 读取数据。这是我的代码。

library(tidyverse)
cluster_tables <- read_tsv("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/all_tables.tsv",skip_empty_rows = T)

它只读取第一列，不显示其余列。如何将此数据集作为 R 中的一个数据框？另外，有没有办法在这个页面上创建一个带有标签表名的列？

解决方法

使用 skip = 4 读取数据

cluster_tables <- readr::read_tsv("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/all_tables.tsv",skip = 4,skip_empty_rows = TRUE)
head(cluster_tables)

#   X1             first_seq  num_seqs last_seq  
#  <chr>          <chr>      <chr>    <chr>     
#1 Netherlands    2020-06-20 1615     2021-01-21
#2 Spain          2020-06-20 2003     2021-01-12
#3 United Kingdom 2020-07-07 69421    2021-01-23
#4 Belgium        2020-07-17 384      2021-01-20
#5 Switzerland    2020-07-22 1706     2021-01-19
#6 Ireland        2020-07-23 603      2021-01-22

由于页面上有多个表格可以在一个数据框中自动读取它们，我们可以进行一些操作。

使用 readLines 读取数据
删除所有空行
每当遇到 '##' 时，将数据集拆分为一个新列表。
对于每个列表，将第一个值（即表名）分开，并将其添加为新列。
在一个大数据帧 (result) 中组合数据帧列表。

tmp <- readLines('https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/all_tables.tsv')
tmp <- tmp[tmp != '']

do.call(rbind,lapply(split(tmp,cumsum(grepl('##',tmp))),function(x) {
  name <- sub('##\\s+','',x[1])
  x <- x[-1]
  transform(read.csv(text = paste0(x,collapse = '\n'),sep = '\t'),name = name)
})) -> result

head(result)
#                 X  first_seq num_seqs   last_seq    name
#1.1    Netherlands 2020-06-20     1615 2021-01-21 20A.EU1
#1.2          Spain 2020-06-20     2003 2021-01-12 20A.EU1
#1.3 United Kingdom 2020-07-07    69421 2021-01-23 20A.EU1
#1.4        Belgium 2020-07-17      384 2021-01-20 20A.EU1
#1.5    Switzerland 2020-07-22     1706 2021-01-19 20A.EU1
#1.6        Ireland 2020-07-23      603 2021-01-22 20A.EU1

从Ronak Shah上面的答案中汲取灵感，我尝试了 tidyverse

library(tidyverse)    

cluster_tables <- readLines('https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/all_tables.tsv')

cluster_tables %>% 
  as_tibble() %>% 
  separate(value,into = c("countries","first_seq","num_seqs","last_seq"),sep = "\t") %>% 
  filter(countries != "") %>% 
  mutate(variants = if_else(str_detect(countries,"## "),countries,NA_character_)) %>% 
  fill(variants,.direction = "down") %>% 
  filter(!is.na(first_seq))

head(cluster_tables)
# A tibble: 6 x 5
  countries      first_seq  num_seqs last_seq   variants  
  <chr>          <chr>      <chr>    <chr>      <chr>     
1 Netherlands    2020-06-20 1615     2021-01-21 ## 20A.EU1
2 Spain          2020-06-20 2003     2021-01-12 ## 20A.EU1
3 United Kingdom 2020-07-07 69421    2021-01-23 ## 20A.EU1
4 Belgium        2020-07-17 384      2021-01-20 ## 20A.EU1
5 Switzerland    2020-07-22 1706     2021-01-19 ## 20A.EU1
6 Ireland        2020-07-23 603      2021-01-22 ## 20A.EU1

dplyr r r readr