问题描述
我编写了一个函数,用于在给定一些键的情况下对数据框中的名称进行匿名化,一旦它对很多名称进行匿名化,它就会爬行,但我不明白为什么。
有问题的数据框是一组通过 Twitter API 收集的 4733 条推文,其中每行是一条包含 32 列数据的推文。无论名称出现在哪一行,这些名称都将被匿名化,因此我不想将该功能限制为仅查看这 32 列中的几列。
key 是一个包含 211121 对真假姓名的数据框,真假姓名在数据框中都是唯一的。大约 10 万个名字被匿名后,该功能会大大减慢。
函数如下所示:
pseudonymize <- function(df,key) {
for(name in key$realNames) {
df <- as.data.frame(apply(df,2,function(column) gsub(name,key[key$realNames == name,2],column)))
}
}
这里是否有一些明显的东西会导致速度变慢?我完全没有优化代码以提高速度的经验。
编辑 1:
以下是要匿名化的数据框中的几行。
"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"@jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","non","iPhone,Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",1917,8,9,0.143476044852192,0.162056634159209,0.000172947386274259,"@abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol",8366,392,661,"Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",1,6,0.000196850793912616,0.00393656926735126,0.200000002980232,"@tdesj3 @belle lol yea doubt it.","lol",1184,87,70,"Halifax","Shari"
这是关键的几行。
"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"
编辑 2:
我已将 DF 简化为仅需要匿名化的两列,这使事情变得更快,但在处理了大约 155k 个名称后它仍然会失败。
按照评论中的要求,这里是要匿名化的 DF 前三行的 dput()
输出。
structure(list(
utilisateur = c("___Yeliab","__courtlezz","__courtlezz"),texte = c("@EmilyIsPro ik lol","@NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol","@NikkiErica21 lol yes _Ã\231։")
),row.names = c(NA,3L),class = "data.frame")
这是键的前三行的 dput()
。
structure(list(
realNames = c("________","___________ass"),fakeNames = c("Abhinav_Chang","Caleb_Dunn-Sparks","Taryn_Hunzicker")
),class = "data.frame")
解决方法
将数据作为向量而不是 data.frame 处理会更有效率。我遇到了一些编码问题,因此使用 iconv
将文本转换为 UTF-8;如果名称包含非 ASCII 字符,则需要进行一些处理。
key1 <- data.frame(
realNames = c("________","____________aho","___________ass","___Yeliab","__courtlezz","NikkiErica21","EmilyIsPro","aho"),fakeNames = c("Abhinav_Chang","Caleb_Dunn-Sparks","Taryn_Hunzicker","A_A","B_B","C_C","D_D","E_E"),stringsAsFactors = FALSE
)
pseudonymize1 <- function(df,key) {
mat <- as.matrix(df)
dims <- attr(mat,which = "dim")
cnam <- colnames(df)
vec <- iconv(unclass(mat),from = "latin1",to = "UTF-8")
for (name in split(key,f = seq_len(nrow(key)))) {
vec <- gsub(
vec,pattern = name$realNames,replacement = name$fakeNames,fixed = TRUE)
}
mat <- vec
attr(mat,which = "dim") <- dims
df <- as.data.frame(mat,stringsAsFactors = FALSE)
colnames(df) <- cnam
df
}
pseudonymize1(df1,key1)
# utilisateur texte
# 1 A_A @D_D ik lol
# 2 B_B @C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3 B_B @C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢
library(microbenchmark)
microbenchmark(
pseudonymize(df1,key1),pseudonymize1(df1,key1)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# pseudonymize(df1,key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371 100 b
# pseudonymize1(df1,key1) 287.683 306.1905 333.678 314.950 339.8705 497.301 100 a
我对 155k 名称的一个担忧是,当作为正则表达式搜索时,您会发现其他名称中包含的名称。这可能是真名中的真名(例如 EmilyIsPro 中的 Emily),或者之前替换的假名中的真名。您需要对此进行测试,并考虑使用随机散列而不是类似名称的假名。