问题描述
我是 R 的新手,所以请耐心等待。我正在查看监禁数据,并且有一个变量 conviction
,它是一个看起来像这样的杂乱字符串:
[1] "Ct. 1: Conspiracy to distribute"
[2] "Aggravated Assault"
[3] "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"
[4] "Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling"
理想情况下,我想做两件事。首先,我想将 Ct.
解析为多列。对于前三行,数据如下所示:
convictions conviction_1 conviction_2
[1,] "Ct. 1: Conspiracy to distribute" "Conspiracy to distribute" NA
[2,] "Aggravated Assault" "Aggravated Assault" NA
[3,] "Ct. 1: Possession of prohibited object" "Possession of prohibited object" "criminal forfeiture"
但是当我到达第三行时事情变得很麻烦,因为我想将字符串的第一部分 (Ct. 1-6: Human Trafficking
) 解析为 6 列,然后将 Ct. 7,8: Unlawful contact
解析为另外 2 列。
第二部分是然后我想生成一个变量 convictions_total
,它会在 conviction
之后的 Ct:
字符串中找到最高数字。对于我在这里包含的三个示例条目,convictions_total
看起来像:
[1] 1 2 36
这是我用来解析一个更直接的字符串变量的代码,但我不确定如何为这个变量调整它:
cols <- data.frame(str_split_fixed(data$convictions`,",Inf))
colnames(cols) <- paste0("conviction_",rep(1:length(cols)))
data <- cbind(data,cols)
先谢谢你!
解决方法
以下适用于您的示例,无需使用太多正则表达式,主要是数字提取或其他字符串检测:
library(stringr)
library(magrittr)
library(purrr)
library(plyr)
convictions_total <- sapply(stringr::str_extract_all(convictions,"\\d+"),function(x) max(as.numeric(x),1))
convictions_split <- strsplit(convictions,";")
reps <- lapply(convictions_split,FUN = function(x) {
sapply(x,FUN = function(i) {
num <- paste(stringr::str_extract_all(i,"[\\d+\\-,]")[[1]],collapse = "")
# "-" indicates a range: take largest value
if (stringr::str_detect(num,"-")){
stringr::str_extract_all(num,"\\d+") %>%
unlist() %>%
as.numeric() %>%
max() %>%
return()
# "," indicates a sequence: get length of sequence
} else if(stringr::str_detect(num,",")){
stringr::str_count(num,") + 1 %>%
as.numeric() %>%
return()
# otherwise return 1
} else {
return(1)
}
})
})
convictions_str <- lapply(convictions_split,function(x) gsub(".*\\d:?\\s(.*)$","\\1",x))
df <- purrr::map2(convictions_str,reps,rep) %>%
plyr::ldply(rbind) %>%
cbind(convictions_total,.) %>%
data.frame() %>%
dplyr::rename_with(~ gsub("X","conviction_",.x),starts_with("X"))
输出
convictions_total conviction_1 conviction_2 conviction_3
1 1 Conspiracy to distribute <NA> <NA>
2 1 Aggravated Assault <NA> <NA>
3 2 Possession of prohibited object criminal forfeiture <NA>
4 36 Human Trafficking Human Trafficking Human Trafficking
conviction_4 conviction_5 conviction_6 conviction_7 conviction_8
1 <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA>
4 Human Trafficking Human Trafficking Human Trafficking Unlawful contact Unlawful contact
conviction_9 conviction_10
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4 Involuntary Servitude Smuggling
数据
convictions <- c("Ct. 1: Conspiracy to distribute","Aggravated Assault","Ct. 1: Possession of prohibited object; Ct.: 2 criminal forfeiture","Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling")
工作原理
-
convictions_total
通过使用stringr::str_extract_all
提取convictions
中每一行的所有数字很容易提取。这将返回一个向量列表。sapply
然后从列表中的每个向量中取最大值并返回一个向量。 -
reps
是一个列表,其中的元素对应于convictions
的元素,它存储了一个数字向量,表示每个定罪计数重复的次数。
代码首先将 convictions
拆分为向量列表,其中向量包含以下提取的信息:数字 (\\d+
)、破折号 (\\-
) 和逗号 ({ {1}})。该逻辑通过搜索这些字符串提取来工作:
- 首先,如果它在定罪计数中找到
,
,则表示一个范围,并再次取最大值。例如,"-"
将返回"Ct. 1-6: Human Trafficking"
。 - 接下来,如果它没有找到
6
,而是"-"
表示计数分隔符。所以它计算逗号分隔符的数量并加一个。例如","
将返回"Cts. 7,8 Unlawful contact"
- 假定其他所有内容仅重复一次,因为它不是一个顺序列表或范围。
2
-
reps [[1]] Ct. 1: Conspiracy to distribute 1 [[2]] Aggravated Assault 1 [[3]] Ct. 1: Possession of prohibited object Ct.: 2 criminal forfeiture 1 1 [[4]] Ct. 1-6: Human Trafficking Cts. 7,8 Unlawful contact Ct. 11: Involuntary Servitude 6 2 1 Ct. 36: Smuggling 1
只是提取实际的定罪信息。例如,代码将从convictions_str
中提取所有定罪的"Ct. 1: Conspiracy to distribute"
等。
"Conspiracy to distribute"
此时[[1]]
[1] "Conspiracy to distribute"
[[2]]
[1] "Aggravated Assault"
[[3]]
[1] "Possession of prohibited object" "criminal forfeiture"
[[4]]
[1] "Human Trafficking" "Unlawful contact" "Involuntary Servitude"
[4] "Smuggling"
和reps
有一个相关的结构:
-
convictions_str
应该重复convictions_str[[1]][1]
次 -
reps[[1]][1]
应该重复convictions_str[[1]][2]
次
-
reps[[1]][2]
利用此结构,使用purrr::map2
函数通过存储在rep
中的值重复convictions_str
中的元素并输出一个列表。reps
行将此列表填充为plyr::ldply
,因为并非每个人都有相同数量的定罪。NA
添加列cbind
,convictions_total
更改列名称。
在经历了两天的兔子洞之后,我找到了@LMc 代码的整洁版本,最终效果更好,因为调用 plyr
会弄乱我编写的其他代码:
test_data <-
tibble(id = 1:5,convictions = c("Ct. 1: Conspiracy to distribute","Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture",8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling 50 grams","Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28: Money Laundering"))
test_data <- test_data %>%
mutate(c2 = convictions) #this just duplicates the original variable convictions because I want to preserve it
test_data <- test_data %>%
separate_rows(c2,sep = ";") %>%
mutate(c2 = str_remove(c2,"Ct(s)?(\\. )(\\d|-|:|,|\\s)+")) %>%
group_by(id) %>%
mutate(conviction_number = paste0("c_",row_number())) %>%
pivot_wider(values_from = c2,names_from = conviction_number)
test_data <- test_data %>%
mutate(c2 = convictions) #again,just preserving the original variable
test_data <- test_data %>%
separate_rows(c2,sep = ";") %>%
mutate(total_counts = as.numeric(ifelse(is.na(str_extract(c2,"((?<=\\-)\\d+)")),str_extract(c2,"((?<=\\-)\\d+)")))) %>%
mutate(total_counts = ifelse(is.na(total_counts),1,total_counts)) %>%
group_by(id) %>%
slice_max(total_counts)
产生以下数据帧:
id convictions c_1 c_2 c_3 c_4 c2 total_counts
<int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 Ct. 1: Conspiracy to distribute Conspiracy to dis~ NA NA NA "Ct. 1: Conspirac~ 1
2 2 Aggravated Assault Aggravated Assault NA NA NA "Aggravated Assau~ 1
3 3 Ct. 1: Possession of prohibited object; Ct. 2: criminal for~ Possession of pro~ " criminal f~ NA NA " Ct. 2: criminal~ 2
4 4 Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct.~ Human Trafficking " Unlawful c~ " Involuntary~ " Smuggling~ " Ct. 36: Smuggli~ 36
5 5 Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28: Money ~ Conspiracy " Wire Fraud" " Money Laund~ NA " Cts. 8-28: Mon~ 28
第一段代码将计数解析为单独的行,然后返回到 c_
列。第二个代码块执行相同的解析,但随后查看每个条目以解析数字,而不是单词。
//d+
查找任何数字,但结果证明我有看起来像 Cts. 2-7
的数据,其中我想要值 7,而不是 2。
((?<=\\-)\\d+)"))
查找连字符,然后解析它后面的数字。如果没有连字符,则默认返回 \\d+
。
最后,slice_max
根据 total_counts
的最大值将数据折叠为每个 ID 1 个条目。