问题描述
请在下面找到我的数据。我遇到了两个问题。
我正在尝试将yy$n_otte
值合并到丢失的h$n_otte
值中。我的方法是在dplyr::left_join
和study
之间用os.neck
,n_sygdom
,age
和h
匹配的yy
。我需要匹配所有这些变量,因为h
和yy
都包含两个大型电子表格。
> head(h)
study os.neck age n_sygdom n_otte
1 B 49.00 53 0 N0
2 B 1.00 83 0 N0
3 A 76.44 63 2 <NA>
4 B 11.00 45 0 N0
5 A 9.21 37 15 <NA>
6 B 1.00 60 1 N1
和
> head(yy)
study os.neck n_sygdom age n_otte
1 A 42.12 0 63 N0
2 A 30.72 0 61 N0
3 A 136.20 0 48 N0
4 A 23.40 0 63 N0
5 A 5.16 3 67 N3b
6 A 33.96 0 58 N0
问题1:为什么as_integer()更改我的值?
> str(yy)
'data.frame': 643 obs. of 5 variables:
$ study : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
$ os.neck : num 42.12 30.72 136.2 23.4 5.16 ...
$ n_sygdom: Factor w/ 22 levels "0","1","10","11",..: 1 1 1 1 13 1 11 11 2 1 ...
$ age : num 63 61 48 63 67 58 23 52 53 62 ...
$ n_otte : Factor w/ 6 levels "N0","N1","N2a",..: 1 1 1 1 6 1 6 4 3 1 ...
我正在尝试
yy <- yy %>% mutate(n_sygdom = as.integer(n_sygdom))
但是yy$n_sygdom
发生了变化。
> head(yy)
study os.neck n_sygdom age n_otte
1 A 42.12 1 63 N0
2 A 30.72 1 61 N0
3 A 136.20 1 48 N0
4 A 23.40 1 63 N0
5 A 5.16 13 67 N3b
6 A 33.96 1 58 N0
问题
yy$n_sygdom
为什么会发生变化?我想将yy$n_sygdom
包括为整数,但显然要保留初始整数。
问题2:left_join匹配未产生预期的输出
很明显,首先需要解决问题1 ,
a <- left_join(h,yy,by=c("study","os.neck","age","n_sygdom"))
收益
由于类型不兼容,无法加入'n_sygdom'x'n_sygdom' (因子/整数)
但是,我遇到的问题也出现在这里(没有n_sygdom
):
a <- left_join(h,"age"))
> head(a)
study os.neck age n_sygdom.x n_otte.x n_sygdom.y n_otte.y
1 B 49.00 53 0 N0 <NA> <NA>
2 B 1.00 83 0 N0 <NA> <NA>
3 A 76.44 63 2 <NA> <NA> <NA>
4 B 11.00 45 0 N0 <NA> <NA>
5 A 9.21 37 15 <NA> 15 N3b
6 B 1.00 60 1 N1 <NA> <NA>
预期输出
> head(a)
study os.neck age n_sygdom n_otte
1 B 49.00 53 0 N0
2 B 1.00 83 0 N0
3 A 76.44 63 2 <NA>
4 B 11.00 45 0 N0
5 A 9.21 37 15 N3b
6 B 1.00 60 1 N1
主要数据
h <- structure(list(study = c("B","B","A","C","B"),os.neck = c(49,1,76.44,11,9.21,2.07,4.08,17,41,38,84.96,5.64,93.86,11.52,5.29,61,10.95,3.68,24,63,21,68,6.12,7,48,11.38,73.68,27.53,12,19,17.98,55,77.77,39,4,13,57.56,24.59,46.55,83.02,14,42,49.58,33.58,33,29.96,10.41,67,8,94.72,2,7.03,46.36,23.76,57.48,14.49,14.69,39.62,5,35.78,75,80.82,54.24,49.12,87,50.96,2.4,10,7.2,34.56,104.08,28,29,5.04,54.96,49,4.27,47.93,60,47,3,32,23,13.97,32),age = c(53,83,45,37,52,64,53,78,43,72,65,59,58,51,62,66,56,69,71,79,54,57,50,77,74,85,70,80,34,81,46,49),n_sygdom = c(0L,0L,2L,15L,1L,8L,6L,3L,5L,20L,4L,9L,23L,10L,0L),n_otte = structure(c(1L,NA,1L),.Label = c("N0","N2b","N2c","N3b"
),class = "factor")),row.names = c(NA,-100L),class = "data.frame")
要提取的数据
yy <- structure(list(study = structure(c(1L,.Label = "A",class = "factor"),os.neck = c(24.84,24.84,9.76,98.28,19.08,111.48,41.52,47.28,35.24,6.38,39.78,35.52,70.08,12.49,19.33,3.02,40.77,32.71,40.08,59.4,52.18,48.33,1.38,26.89,59.18,6.24,80.65,5.13,49.84,9.48,3.25,46.42,25.15,10.8,17.1,27.6,4.68,12.3,52.96,49.97,10.98,44.64,9.5,20.19,11.97,22.88,60.59,85.15,55.04,28.2,33.96,2.76,4.77,9.96,33.4,27.29,37.2,36.36,90.28,53.65,32.09,68.28,7.63,22.32,43.2,9.36,5.88,14.79,48.1,45.24,110.01,42.12,0.3,0.56,11.88,46.26,59.15,87.22,11.93,88.8,29.19,14.07,11.21,16.08,20.58,3.48,73.74,45.72),n_sygdom = structure(c(2L,16L,11L,13L,18L,17L,.Label = c("0","12","13","14","15","17","18","2","20","3","35","39","4","5","6","7","8","9","number"
),age = c(44,44,30,35,76,26,73,82,86,84,27,34),n_otte = structure(c(3L,"N3b"),class = "data.frame")
解决方法
问题1的解决方案:
要将因子转换为等效数字,您需要先转换为字符。 factors
在内部存储为数字,因此当您直接将它们转换为数字时,它将返回其内部数字表示形式。
这个例子可能很清楚:
as.integer(factor(c(2,10,3,0)))
[1] 2 4 3 1
as.integer(as.character(factor(c(2,0))))
[1] 2 10 3 0
以您的示例为例,
library(dplyr)
yy <- yy %>% mutate(n_sygdom = as.integer(as.character(n_sygdom)))
问题2的解决方案:
现在,您可以加入h
和yy
并使用coalesce
来获得n_otte.x
和n_otte.y
之间的第一个非NA值。
left_join(h,yy,by=c("study","os.neck","age","n_sygdom")) %>%
mutate(n_otte = coalesce(n_otte.x,n_otte.y)) %>%
select(-n_otte.x,-n_otte.y)