如何将使用多个分隔符的奇怪文件类型读取到 R 中?

问题描述

我的源文件来自一台旧的测试机器,它会吐出“*.ctf”文件。当我使用 readLines() 打开文件时,我得到一个长向量,该向量的部分前面是“[header_name]”,部分内(在标题之间)4 列由制表符“\t”分隔。

理想情况下,我想将每个部分分成各自的 4 列列表/数据框。

这是使用 readLines() 将向量读入 R 后的示例 (注意我从第 5 行跳到了第 21 行)

vector
    [1] "[HEADER]" "Created by Sigma-1 ICON Version 4.5.3; copyright 2005,GEOTAC"
    [3] "Project:\tACC#1210004 \tLoad Frame Name:\tLoad Frame" "Date:\t1/1/2002 \tTime:\t12:39:01 AM "                                      
    [5] "Boring:\tBoring2\tSample:\tSample7"
    
    ...
    
    [21] "" "[STEP 1]\t850\t0"                                                          
    [23] "Time\tExternal Load Cell\tDCDT\tPlaten Position" "1/1/2002 12:40:52 AM\t-2.31623424260761E-04 \t 3.45233241577262 \t 3150948 "
    [25] "1/1/2002 12:41:07 AM\t-3.22715023139608E-04 \t 3.45440429846349 \t 3157103 " "1/1/2002 12:41:22 AM\t-3.2964900303341E-04 \t 3.4553244755898 \t 3158611 " 

理想情况下,读取文件生成多个以 [header] 命名的列表,并由“\t”分隔为 4 列,前 4 列是列标题。例如,[STEP 1] 在 EXCEL 中看起来像这样,类似这样的数据帧就完美了。

Snippet from Excel

我希望 read.table 之类的东西可以使用制表符分隔符来处理这个问题,但它会抛出错误,因为有多个列彼此顶部。

编辑以回应评论

    > dput(head(vector,40))
c("[HEADER]","Created by Sigma-1 ICON Version 4.5.3; copyright 2005,GEOTAC","Project:\tACC#1210004 \tLoad Frame Name:\tLoad Frame","Date:\t1/1/2002 \tTime:\t12:39:01 AM ","Boring:\tBoring2\tSample:\tSample7","Specimen:\tSpecimen1\tDepth (ft):\t 21 ","Diameter (inch):\t 2.5025 \tHeight (inch):\t 1.00825 ","Comments:\tTare J 217.028 paper .311 .463  wet weight 379.024g","","[SENSORS]","Name\tExternal Load Cell\tDCDT\tLoad Frame Encoder","ID\t227396\tLP183\tN/A","Module\tLoad Frame Adio\tLoad Frame Adio\tN/A","Channel\t 1 \t 2 \tN/A","Unit\tlbs\tinch\tinch","Cal. Factor\t-796107.1205 \t 3.02704684 \t 3940000 ","Excitation\t 9.98139953613281 \t 9.98139953613281 \tN/A","Zero\t 3.38862647549831E-05 \t 3.10994816131097 \tN/A","Min. Reading\t-1000 \t-0.05 \t0.0","Max. Reading\t 2000 \t 3 \t 1.5 ","[STEP 1]\t850\t0","Time          \tExternal Load Cell\tDCDT\tPlaten Position","1/1/2002 12:40:52 AM\t-2.31623424260761E-04 \t 3.45233241577262 \t 3150948 ","1/1/2002 12:41:07 AM\t-3.22715023139608E-04 \t 3.45440429846349 \t 3157103 ","1/1/2002 12:41:22 AM\t-3.2964900303341E-04 \t 3.4553244755898 \t 3158611 ","1/1/2002 12:41:38 AM\t-3.35823094719672E-04 \t 3.45592288755324 \t 3159627 ","1/1/2002 12:41:53 AM\t-3.34113346252707E-04 \t 3.45707221846715 \t 3160244 ","1/1/2002 12:42:24 AM\t-3.25707082956796E-04 \t 3.45724794261514 \t 3160806 ","1/1/2002 12:42:54 AM\t-3.34350811317563E-04 \t 3.45749134430662 \t 3161526 ","1/1/2002 12:43:24 AM\t-3.32652936103841E-04 \t 3.4578036108669 \t 3161849 ","1/1/2002 12:43:54 AM\t-3.31216272461461E-04 \t 3.45799833222009 \t 3162033 ","1/1/2002 12:44:54 AM\t-3.2508967378817E-04 \t 3.45834978051607 \t 3162380 ","1/1/2002 12:45:54 AM\t-3.28473550962372E-04 \t 3.45827497902064 \t 3162464 ","1/1/2002 12:46:54 AM\t-3.32878527915454E-04 \t 3.4585171933868 \t 3162704 ","1/1/2002 12:47:54 AM\t-3.23534277613362E-04 \t 3.45914291383269 \t 3161933 ","1/1/2002 12:49:54 AM\t-3.38494576699304E-04 \t 3.45977932020651 \t 3162452 ","1/1/2002 12:50:56 AM\t-3.31038173662819E-04 \t 3.45979950473702 \t 3159002 ","[STEP 2]\t1700\t0")

解决方法

如果您的数据遵循 [STEP1][COLUMNNAMES][DATA][STEP2][COLUMNNAMES][DATA].... 我认为这会起作用。

start <- grep('^Time',vector)
end <- grep('\\[STEP',vector)[-1] - 2
result <- do.call(rbind,Map(function(x,y) 
             read.csv(text = paste0(vector[x:y],collapse = '\n'),sep = '\t'),start,end))
result

这里的逻辑是我们假设第一个列名是 'Time' 并且数据从那里开始直到找到下一个 STEP。

,

我一直从社交媒体监控输出中得到类似的问题。 在这里,我假设 rl_text 是您粘贴到 dput 中的线向量。 我加载了 tidyversesplitstackshape 包。

library(tidyverse)
library(splitstackshape)

df_raw <- 
  rl_text %>% 
    as_tibble() %>% 
    rowid_to_column(var = "line_id") %>% 
    splitstackshape::cSplit("value",sep = "\t",direction = "wide") %>% 
    mutate_if(.predicate = is.factor,as.character) %>% 
    mutate(is_header = grepl("^\\[.*\\]$",value_1),## Here I check if the first column has a header,identified as a string that begins and ends in straight brackets []
           header = ifelse(is_header == TRUE,value_1,NA)) %>% 
    fill(header)

ls_headers <- unique(df_raw$header)

unnest_dfs <- function(headers,df = df_raw) {
  ## Function to get a df out of those rows under a common header.
  df_filtered <- filter(df,header == headers) %>% select(-c(line_id,is_header,header))
  df_filtered
}

list_with.dfs <- map(ls_headers,unnest_dfs)

[[1]]
# A tibble: 9 x 4
  value_1                                                    value_2                                          value_3         value_4   
  <chr>                                                      <chr>                                            <chr>           <chr>     
1 [HEADER]                                                   NA                                               NA              NA        
2 Created by Sigma-1 ICON Version 4.5.3; Copyright 2005,GE~ NA                                               NA              NA        
3 Project:                                                   ACC#1210004                                      Load Frame Nam~ Load Frame
...        
[[2]]
# A tibble: 12 x 4
   value_1      value_2              value_3          value_4           
   <chr>        <chr>                <chr>            <chr>             
 1 [SENSORS]    NA                   NA               NA                
 2 Name         External Load Cell   DCDT             Load Frame Encoder
 3 ID           227396               LP183            N/A               
 ...
[[3]]
# A tibble: 18 x 4
   value_1              value_2               value_3          value_4        
   <chr>                <chr>                 <chr>            <chr>          
 1 [STEP 1]             850                   0                NA             
 2 Time                 External Load Cell    DCDT             Platen Position
 3 1/1/2002 12:40:52 AM -2.31623424260761E-04 3.45233241577262 3150948        
 ...           
[[4]]
# A tibble: 1 x 4
  value_1  value_2 value_3 value_4
  <chr>    <chr>   <chr>   <chr>  
1 [STEP 2] 1700    0       NA