问题描述
我有一个包含数千行的数据文件,其中有一些空白,我希望用一个值来填充。 我需要将空单元格替换为其上方的值。 让您更容易了解我的数据,这是一个示例
Variable <- c("AGE","","SEX","SEGMENT","")
Value <- c(1,2,3,4,1,5)
Description <- c("18-24","25-34","35-44","45+","Female","Male","A","B","C","D","E")
df <- data.frame(Variable,Value,Description)
> df
Variable Value Description
1 AGE 1 18-24
2 2 25-34
3 3 35-44
4 4 45+
5 SEX 1 Female
6 2 Male
7 SEGMENT 1 A
8 2 B
9 3 C
10 4 D
11 5 E
您可以在第一列上方看到空白。我需要将这些空单元格替换为上面的相关值,以便新变量在数据框中看起来像这样
> df
Variable Value Description Variable_NEW
1 AGE 1 18-24 AGE
2 2 25-34 AGE
3 3 35-44 AGE
4 4 45+ AGE
5 SEX 1 Female SEX
6 2 Male SEX
7 SEGMENT 1 A SEGMENT
8 2 B SEGMENT
9 3 C SEGMENT
10 4 D SEGMENT
11 5 E SEGMENT
大声思考。我假设要实现此目标,我将需要使用循环创建一个新变量,然后使用类似这样的逻辑
IF Variable[n]="" THEN Variable_New[n] = Variable[n-1],ELSE Variable_New[n] = Variable[n]
我熟悉循环,但是不知道如何在具有滞后/ n-1函数的R中编写这种东西。可能有很多方法可以完成此操作,但是最好使用循环。任何帮助将不胜感激。谢谢
解决方法
这里有一个循环方法:
#Data
Variable <- c("AGE","","SEX","SEGMENT","")
Value <- c(1,2,3,4,1,5)
Description <- c("18-24","25-34","35-44","45+","Female","Male","A","B","C","D","E")
df <- data.frame(Variable,Value,Description,stringsAsFactors = F)
#Create new column
df$NewVar <- df$Variable
#Loop
for(i in 2:dim(df)[1])
{
df$NewVar[i] <- ifelse(df$NewVar[i]=="",df$NewVar[i-1],df$NewVar[i])
}
输出:
Variable Value Description NewVar
1 AGE 1 18-24 AGE
2 2 25-34 AGE
3 3 35-44 AGE
4 4 45+ AGE
5 SEX 1 Female SEX
6 2 Male SEX
7 SEGMENT 1 A SEGMENT
8 2 B SEGMENT
9 3 C SEGMENT
10 4 D SEGMENT
11 5 E SEGMENT
,
您无需编写循环,内置的函数可以帮助您完成此任务。
您可以使用replace
NA
空白值并使用fill
:
library(dplyr)
df %>%
mutate(Variable_NEW = replace(Variable,Variable == "",NA)) %>%
tidyr::fill(Variable_NEW)
# Variable Value Description Variable_NEW
#1 AGE 1 18-24 AGE
#2 2 25-34 AGE
#3 3 35-44 AGE
#4 4 45+ AGE
#5 SEX 1 Female SEX
#6 2 Male SEX
#7 SEGMENT 1 A SEGMENT
#8 2 B SEGMENT
#9 3 C SEGMENT
#10 4 D SEGMENT
#11 5 E SEGMENT
,
您可以使用循环编写自己的函数,也可以使用na.locf
包中的zoo
函数来填写缺少的NA
值。示例:
fillin <- function(x) {
for (i in 2:length(x)) {
if (x[i] %in% c(NA,"")) {
x[i] <- x[i - 1]
}
}
x
}
Variable <- c("AGE",Description)
df$Variable_fillin <- fillin(df$Variable)
library(zoo)
df$Variable[df$Variable == ""] <- NA
df$Variable_nalocf <- na.locf(df$Variable)
df
#> Variable Value Description Variable_fillin Variable_nalocf
#> 1 AGE 1 18-24 AGE AGE
#> 2 <NA> 2 25-34 AGE AGE
#> 3 <NA> 3 35-44 AGE AGE
#> 4 <NA> 4 45+ AGE AGE
#> 5 SEX 1 Female SEX SEX
#> 6 <NA> 2 Male SEX SEX
#> 7 SEGMENT 1 A SEGMENT SEGMENT
#> 8 <NA> 2 B SEGMENT SEGMENT
#> 9 <NA> 3 C SEGMENT SEGMENT
#> 10 <NA> 4 D SEGMENT SEGMENT
#> 11 <NA> 5 E SEGMENT SEGMENT
,
这用缺少的字符替换了“”,然后修复了名为Variable的变量:
df %>%
dplyr::mutate_all(list(~na_if(.,""))) %>%
tidyr::fill(Variable,.direction = "down")
,
使用data.table和for循环:
library(data.table)
DT <- as.data.table(df)
DT[,Variable_new := Variable[1]]
for (i in 2:nrow(DT)) {
DT[i,Variable_new := fifelse(DT[i,Variable] == '',DT[i-1,Variable_new],DT[i,Variable])]
}
> DT
Variable Value Description Variable_new
1: AGE 1 18-24 AGE
2: 2 25-34 AGE
3: 3 35-44 AGE
4: 4 45+ AGE
5: SEX 1 Female SEX
6: 2 Male SEX
7: SEGMENT 1 A SEGMENT
8: 2 B SEGMENT
9: 3 C SEGMENT
10: 4 D SEGMENT
11: 5 E SEGMENT