在R中使用整理功能后如何包含所有数据?

问题描述

下面显示了10个“ Referer URl”示例

https://www.google.com/ | query_string=utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiocHGGw6JEiJaf5zMhRxFk-AOtixMOd_1szoBoCUEMQAvD_BwE | ip_address=123.21.62.57 | user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
https://www.Type2online.com/ | query_string=null | ip_address=113.193.43.211 | user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/85.0.4183.102 Safari/537.36
https://www.google.com/ | query_string=gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE | ip_address=187.11.116.117 | user_agent=Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36

Other URLs with no parameters are
https://m.facebook.com/
instagram.com
https://l.facebook.com
/https://www.google.com/
http://m.facebook.com


我正在使用下面的代码来分隔以上URL参数,并为每个参数创建一个新列

Mydata$ref_url<-trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4,byrow = TRUE)[,1])

Mydata$query_string<-gsub("query_string=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),2]))

Mydata$ip_address<-gsub("ip_address=",3]))

Mydata$user_agent<-gsub("user_agent=",4]))

使用这些功能中的每一个,都会出现以下错误

    Error: Assigned data `trimws(...)` must be compatible with existing data.
    x Existing data has 2645 rows.
    x Assigned data has 1096 rows.
    i Only vectors of size 1 are recycled.
    Run `rlang::last_error()` to see where the error occurred.
    In addition: Warning message:
    In matrix(unlist(strsplit(as.character(Mydata$"Referer URL"),"|",:
      data length [4382] is not a sub-multiple or multiple of the number of rows [1096]

如何纠正此问题?

解决方法

如果可以保证所有参数具有相同的顺序,请使用tidyverse,以下代码将给出所需的输出:

library(tidyverse)
ref %>% separate(V1,paste0("V",2:5),sep=" \\| ") -> separated
names(separated) <- c("url",gsub("=.+","",separated[1,2:4]))
separated %>% mutate_all( ~ sub(".+?=",.)) 
#>                            url                                                                                                                                          query_string     ip_address                                                                                                                    user_agent
#> 1      https://www.google.com/ utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiOcHGGw6JEiJaf5zMhRxFk-AOtiXMOd_1szoBoCUEMQAvD_BwE   123.21.62.57                                            Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
#> 2 https://www.Type2online.com/                                                                                                                                                  null 113.193.43.211           Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/85.0.4183.102 Safari/537.36
#> 3      https://www.google.com/                                                     gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE 187.11.116.117 Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
#> 4      https://m.facebook.com/                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 5                instagram.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 6       https://l.facebook.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 7     /https://www.google.com/                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 8        http://m.facebook.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>