问题描述
我正在尝试使用“dplyr”命令 mutate 创建一个变量,该变量必须指示另一个变量的分位数。
例如:
TypeTag[this.type]
到目前为止,我已经创建了一个函数来识别和返回分位数作为一个因子,并且它确实有效
# 1. Fake data:
data <- data.frame(
"id" = seq(1:20),"score" = round(rnorm(20,30,20)))
# 2. Creating varaible 'Quantile_5'
data <-data %>%
mutate(Quntile_5 = ????)
但是,例如,如果我想创建一个变量“Quantile_100”作为指示每个观察值从 1 到 100 的哪个位置的因子(在较大数据集的上下文中),这不是一个很好的解决方案。有没有更简单的方法来创建这些五分位数变量?
解决方法
我希望这就是您要找的:
library(dplyr)
data <- data.frame(
"id" = seq(1:20),"score" = round(rnorm(20,30,20)))
data %>%
mutate(quantile100 = findInterval(score,quantile(score,probs = seq(0,1,0.01)),rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile100
1 1 59 95
2 2 47 90
3 3 83 100
4 4 33 53
5 5 7 11
6 6 26 43
7 7 16 16
8 8 18 27
9 9 33 53
10 10 47 90
我选择关闭最右边的 bin,以便最大类别不超过 100。 我们也可以用你自己的例子来验证它,这会导致相同的结果:
df %>%
mutate(quantile100 = findInterval(score,0.2)),rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile5
1 1 55 5
2 2 56 5
3 3 26 3
4 4 42 3
5 5 41 3
6 6 26 3
7 7 57 5
8 8 12 1
9 9 21 2
10 10 25 2
数据
structure(list(id = 1:20,score = c(55L,56L,26L,42L,41L,57L,12L,21L,25L,37L,18L,54L,47L,52L,-4L,53L,51L,-7L,-2L)),class = "data.frame",row.names = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20"))
,
这里有两个带有 cut
的选项:
1.
library(dplyr)
data %>% mutate(quantile100 = cut(score,100,label = FALSE))
#This is similar to @Anoushiravan R `findInterval` function.
data %>%
mutate(quantile100 = cut(score,unique(quantile(score,seq(0,0.01))),labels = FALSE))