问题描述
下面显示了10个“ Referer URl”示例
https://www.google.com/ | query_string=utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiocHGGw6JEiJaf5zMhRxFk-AOtixMOd_1szoBoCUEMQAvD_BwE | ip_address=123.21.62.57 | user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
https://www.Type2online.com/ | query_string=null | ip_address=113.193.43.211 | user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/85.0.4183.102 Safari/537.36
https://www.google.com/ | query_string=gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE | ip_address=187.11.116.117 | user_agent=Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
Other URLs with no parameters are
https://m.facebook.com/
instagram.com
https://l.facebook.com
/https://www.google.com/
http://m.facebook.com
我正在使用下面的代码来分隔以上URL参数,并为每个参数创建一个新列
Mydata$ref_url<-trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4,byrow = TRUE)[,1])
Mydata$query_string<-gsub("query_string=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),2]))
Mydata$ip_address<-gsub("ip_address=",3]))
Mydata$user_agent<-gsub("user_agent=",4]))
Error: Assigned data `trimws(...)` must be compatible with existing data.
x Existing data has 2645 rows.
x Assigned data has 1096 rows.
i Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In matrix(unlist(strsplit(as.character(Mydata$"Referer URL"),"|",:
data length [4382] is not a sub-multiple or multiple of the number of rows [1096]
如何纠正此问题?
解决方法
如果可以保证所有参数具有相同的顺序,请使用tidyverse
,以下代码将给出所需的输出:
library(tidyverse)
ref %>% separate(V1,paste0("V",2:5),sep=" \\| ") -> separated
names(separated) <- c("url",gsub("=.+","",separated[1,2:4]))
separated %>% mutate_all( ~ sub(".+?=",.))
#> url query_string ip_address user_agent
#> 1 https://www.google.com/ utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiOcHGGw6JEiJaf5zMhRxFk-AOtiXMOd_1szoBoCUEMQAvD_BwE 123.21.62.57 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
#> 2 https://www.Type2online.com/ null 113.193.43.211 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/85.0.4183.102 Safari/537.36
#> 3 https://www.google.com/ gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE 187.11.116.117 Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
#> 4 https://m.facebook.com/ <NA> <NA> <NA>
#> 5 instagram.com <NA> <NA> <NA>
#> 6 https://l.facebook.com <NA> <NA> <NA>
#> 7 /https://www.google.com/ <NA> <NA> <NA>
#> 8 http://m.facebook.com <NA> <NA> <NA>