问题描述
所以这应该是一个将列表中的项目拉入数据框中的相对简单的问题,但我遇到了一些问题。
我有以下列表(我只是为你展示了列表的一部分,它比这长得多):
str(raw_jobs_list)
List of 2
$ :List of 4
..$ id : chr "3594134"
..$ score : int 1
..$ fields:List of 16
.. ..$ date :List of 3
.. .. ..$ changed: chr "2020-04-18T00:35:00+00:00"
.. .. ..$ created: chr "2020-04-07T11:15:37+00:00"
.. .. ..$ closing: chr "2020-04-17T00:00:00+00:00"
.. ..$ country :List of 1
.. .. ..$ :List of 6
.. .. .. ..$ href : chr "https://api.reliefweb.int/v1/countries/149"
.. .. .. ..$ name : chr "Mali"
.. .. .. ..$ location :List of 2
.. .. .. .. ..$ lon: num -1.25
.. .. .. .. ..$ lat: num 17.4
.. .. .. ..$ id : int 149
.. .. .. ..$ shortname: chr "Mali"
.. .. .. ..$ iso3 : chr "mli"
.. ..$ title : chr "REGIONAL MANAGER West Africa"
我尝试使用以下方法将它们拉出:
jobs_data_df <- list.stack(list.select(raw_jobs_list,fields$title,fields$country$name,fields$date$created))
其中 raw_jobs_list 是列表,但我得到了这些 NA 并且不知道如何通过它。
glimpse(jobs_data_df)
Rows: 2
Columns: 3
$ V1 <chr> "REGIONAL MANAGER West Africa","Support Relief Group Public Health Advisor (Multiple Positions)"
$ V2 <lgl> NA,NA
$ V3 <chr> "2020-04-07T11:15:37+00:00","2020-05-04T15:20:37+00:00"
可能有一些明显的东西被我忽略了,因为我以前很少使用列表。有什么想法吗?
非常感谢!
附注。如果您感兴趣,我正在使用这个 API,这就是我到目前为止的方法。
jobs <- GET(url = "https://api.reliefweb.int/v1/jobs?appname=apidoc&preset=analysis&profile=full&limit=2")
raw_jobs_list <- content(jobs)$data
上面显示的部分是整个数据的一个子集;这是列表第一个元素的一部分:
dput(lapply(raw_jobs_list,function(x) c(x[c("id","score")],list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))]))))
list(list(id = "3594134",score = 1L,fields = list(date = list(
changed = "2020-04-18T00:35:00+00:00",created = "2020-04-07T11:15:37+00:00",closing = "2020-04-17T00:00:00+00:00"),country = list(list(
href = "https://api.reliefweb.int/v1/countries/149",name = "Mali",location = list(lon = -1.25,lat = 17.35),id = 149L,shortname = "Mali",iso3 = "mli")),title = "REGIONAL MANAGER West Africa")),list(id = "3594129",fields = list(date = list(
changed = "2020-05-19T00:04:01+00:00",created = "2020-05-04T15:20:37+00:00",closing = "2020-05-18T00:00:00+00:00"),title = "Support Relief Group Public Health Advisor (Multiple Positions)")))
解决方法
如果您一次只查看一个元素,我认为 as.data.frame
做得相当不错。虽然我将使用缩写数据(我编辑到您的问题中)进行演示,但第一个元素如下所示:
raw_jobs_sublist <- lapply(raw_jobs_list,function(x) c(x[c("id","score")],list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))
as.data.frame(raw_jobs_sublist[[1]])
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
以不同的方式显示(这里只是为了多样性),它是
str(as.data.frame(raw_jobs_sublist[[1]]))
# 'data.frame': 1 obs. of 13 variables:
# $ id : chr "3594134"
# $ score : int 1
# $ fields.date.changed : chr "2020-04-18T00:35:00+00:00"
# $ fields.date.created : chr "2020-04-07T11:15:37+00:00"
# $ fields.date.closing : chr "2020-04-17T00:00:00+00:00"
# $ fields.country.href : chr "https://api.reliefweb.int/v1/countries/149"
# $ fields.country.name : chr "Mali"
# $ fields.country.location.lon: num -1.25
# $ fields.country.location.lat: num 17.4
# $ fields.country.id : int 149
# $ fields.country.shortname : chr "Mali"
# $ fields.country.iso3 : chr "mli"
# $ fields.title : chr "REGIONAL MANAGER West Africa"
为了对所有元素执行此操作,我们需要考虑以下几点:
- 并非所有元素都具有所有字段,因此我们使用的任何方法都需要“填空”;
- 我们不想反复进行,让我们一次性将它们组合起来。
这是第一次刺杀:
dplyr::bind_rows(lapply(raw_jobs_sublist,as.data.frame))
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
# 2 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Support Relief Group Public Health Advisor (Multiple Positions)
这也适用于 data.table::rbindlist
。它不适用于 do.call(rbind.data.frame,...)
,因为它对缺失名称的容忍度较低。 (这可以轻松完成,使用这两个选项偶尔还有其他好处。)
注意:如果您对原始数据执行此操作,R 显示 data.frame
的默认机制会使您的控制台挤满所有文本,这可能会很烦人。如果您已经在任何工作中使用 dplyr
或 data.table
,这两种格式都提供字符串限制,因此在控制台上更容易接受。例如,显示整个事情:
tibble::tibble(dplyr::bind_rows(lapply(raw_jobs_list,as.data.frame)))
# # A tibble: 2 x 42
# id score fields.date.cha~ fields.date.cre~ fields.date.clo~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.career_c~ fields.career_c~ fields.name fields.source.h~ fields.source.n~ fields.source.id fields.source.t~ fields.source.t~ fields.source.s~ fields.source.h~ fields.title fields.body
# <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <chr>
# 1 3594~ 1 2020-04-18T00:3~ 2020-04-07T11:1~ 2020-04-17T00:0~ https://api.rel~ Mali -1.25 17.4 149 Mali mli Donor Relations~ 20966 Bamako https://api.rel~ ICCO COOPERATION 45059 Non-governmenta~ 274 ICCO COOPERATION https://www.icc~ REGIONAL MA~ "**VACANCY~
# 2 3594~ 1 2020-05-19T00:0~ 2020-05-04T15:2~ 2020-05-18T00:0~ <NA> <NA> NA NA NA <NA> <NA> Program/Project~ 6867 <NA> https://api.rel~ US Agency for I~ 1751 Government 271 USAID http://www.usai~ Support Rel~ "### **SOL~
# # ... with 18 more variables: fields.type.name <chr>,fields.type.id <int>,fields.experience.name <chr>,fields.experience.id <int>,fields.url <chr>,fields.url_alias <chr>,fields.how_to_apply <chr>,fields.id <int>,fields.status <chr>,fields.body.html <chr>,fields.how_to_apply.html <chr>,href <chr>,fields.source.longname <chr>,fields.source.spanish_name <chr>,# # fields.theme.name <chr>,fields.theme.id <int>,fields.theme.name.1 <chr>,fields.theme.id.1 <int>
data.table::rbindlist(lapply(raw_jobs_list,as.data.frame),fill = TRUE)
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.career_categories.name fields.career_categories.id fields.name
# <char> <int> <char> <char> <char> <char> <char> <num> <num> <int> <char> <char> <char> <int> <char>
# 1: 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countri... Mali -1.25 17.35 149 Mali mli Donor Relations/Grants Management 20966 Bamako
# 2: 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Program/Project Management 6867 <NA>
# 27 variables not shown: [fields.source.href <char>,fields.source.name <char>,fields.source.id <int>,fields.source.type.name <char>,fields.source.type.id <int>,fields.source.shortname <char>,fields.source.homepage <char>,fields.title <char>,fields.body <char>,fields.type.name <char>,...]
对于 data.table
,我已经设置了一些选项来促进这一点。值得注意的是,我目前正在使用:
options(
datatable.prettyprint.char = 36,datatable.print.topn = 10,datatable.print.class = TRUE,datatable.print.trunc.cols = TRUE
)
此时,您有一个 data.frame
应该包含所有数据(以及 NA
用于缺少字段的元素)。从这里开始,如果您不喜欢嵌套名称约定(例如,fields.date.changed
),那么可以使用模式或传统方法轻松重命名它们。