Stata:使用不一致的分隔符导入 .txt

问题描述

我有一个带有相对奇怪分隔符的 .txt 文件。数据看起来像这样:

|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,|     |,|FOURTH QUARTER REPORT|,||
|ABC5|,|Name2,extraname|,|NameRaw2|,|XY2|,266539.0,|pac  |,|MID-YEAR REPORT|,||
|ABC6|,|Name3|,|NameRaw3|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|

所以存在一些没有管道的变量的问题,例如这里的第六个变量只是一个没有管道的数量,而有些变量只有在它们为空时才没有管道,就像这里的第四个变量是 {{1} } 或 ,。一些变量也有逗号,所以我不能使用逗号作为分隔符。所以基本上有两个问题:

  1. 分隔符是逗号,但逗号也会出现在字符串值中
  2. 有些变量在管道内,有些不在,有些只有在它们不为空时才存在

我正在寻找一种在 Stata 中解决此问题的方法。有人知道怎么做吗?

解决方法

如果完整数据集比这个例子更混乱,我真的不想知道。但这似乎有些道理。

* Example generated by -dataex-. To install: ssc install dataex
clear
input str100 whatever
"|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,|     |,|FOURTH QUARTER REPORT|,||"
"|ABC5|,|Name2,extraname|,|NameRaw2|,|XY2|,266539.0,|pac  |,|MID-YEAR REPORT|,||"
"|ABC6|,|Name3|,|NameRaw3|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|"
end

gen work = whatever
replace work = subinstr(work,",||,.)

forval j = 1/5 {
    gen work`j' = substr(work,1,strpos(work,"|,") + 1)
    replace work = subinstr(work,work`j',"",1)
}

gen work6 = substr(work,"))
replace work = subinstr(work,work6,1)

forval j = 7/8 {
    gen work`j' = substr(work,1)
}

gen work9 = work  
drop work 

forval j = 1/9 { 
    replace work`j' = trim(subinstr(work`j',"|",.)) 
    replace work`j' = substr(work`j',length(work`j') - 1) if substr(work`j',-1,1) == ","
}

list 

    +-----------------------------------------------------------------------------------+
  1. |                                                                          whatever |
     |    |ABC4|,|| |
     |-----------------------------------------------------------------------------------|
     | work1  |            work2  |    work3  |  work4  |  work5  |     work6  |  work7  |
     |  ABC4  |            Name1  | NameRaw1  |      y  |    XY1  |   10000.0  |         |
     |-----------------------------------------------------------------------------------|
     |                              work8              |              work9              |
     |              FOURTH QUARTER REPORT              |                                 |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  2. |                                                                          whatever |
     | |ABC5|,|| |
     |-----------------------------------------------------------------------------------|
     | work1  |            work2  |    work3  |  work4  |  work5  |     work6  |  work7  |
     |  ABC5  | Name2,extraname  | NameRaw2  |         |    XY2  |  266539.0  |  pac    |
     |-----------------------------------------------------------------------------------|
     |                              work8              |              work9              |
     |                    MID-YEAR REPORT              |                                 |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  3. |                                                                          whatever |
     |      |ABC6|,|XYZ| |
     |-----------------------------------------------------------------------------------|
     | work1  |            work2  |    work3  |  work4  |  work5  |     work6  |  work7  |
     |  ABC6  |            Name3  | NameRaw3  |      y  |   X,Y3  |   60000.0  |  name   |
     |-----------------------------------------------------------------------------------|
     |                              work8              |              work9              |
     |                    YEAR-END REPORT              |                XYZ              |
     +-----------------------------------------------------------------------------------+