问题描述
200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804
200416657210345 1665721 20040907 20090203 20070331 20080719
200416657210347 1665721 20040914 20091026 20070213 20080114 20090302
200416657210352 1665721 20041111 20100315 20070123 20071205 20081202
我正在尝试使用 read.fwf 读取 .txt 文件:
gripalisti <- read.fwf(file = "gripalisti.txt",widths = c(15,8,9,9),header = FALSE,#stringsAsFactors = FALSE,col.names = c("einst","bu","faeding","forgun","burdur1","burdur2","burdur3","burdur4"))
这行得通,而且列的长度是正确的。 然而,“einst”和“bu”应该是整数值,其余应该是日期。
导入时第一列(ID 变量)中的所有值如下所示:
2.003140e+14
我一直在尝试寻找一种将导入的列更改为整数(或字符?)值的方法,但我没有发现任何不会导致错误的内容。 一个例子,我在谷歌之后尝试过:
gripalisti <- read.fwf(file = "gripalisti.txt","burdur4"),colclasses = c("integer","integer","Date","Date"))
导致错误:
Error in read.table(file = FILE,header = header,sep = sep,row.names = row.names,:
unused argument (colclasses = c("integer","Date"))
数据集中有许多缺失值超过 100.000 行。所以其他导入方式对我不起作用。数据集不是制表符分隔的。
编辑:
多谢帮助,我改成:
colClasses = c("character",
现在看起来不错。
解决方法
正如评论中所建议的:
- 是
colClasses=
,不是colclasses=
,错别字; - 第一个字段不能存储为
"integer"
,它必须是"numeric"
或"character"
; - (另外)这些日期不是默认的
%Y-%m-%d
格式,您需要在读入数据后对其进行转换。
准备:
writeLines("200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804\n200416657210345 1665721 20040907 20090203 20070331 20080719 \n200416657210347 1665721 20040914 20091026 20070213 20080114 20090302 \n200416657210352 1665721 20041111 20100315 20070123 20071205 20081202",con = "gripalisti.txt")
执行:
dat <- read.fwf("gripalisti.txt",widths = c(15,8,9,9),header = FALSE,col.names = c("einst","bu","faeding","forgun","burdur1","burdur2","burdur3","burdur4"),colClasses = c("character","integer","character","character"))
str(dat)
# 'data.frame': 4 obs. of 8 variables:
# $ einst : chr "200416657210340" "200416657210345" "200416657210347" "200416657210352"
# $ bu : int 1665721 1665721 1665721 1665721
# $ faeding: chr " 20040608" " 20040907" " 20040914" " 20041111"
# $ forgun : chr " 20090930" " 20090203" " 20091026" " 20100315"
# $ burdur1: chr " 20060910" " 20070331" " 20070213" " 20070123"
# $ burdur2: chr " 20070910" " 20080719" " 20080114" " 20071205"
# $ burdur3: chr " 20080827" " " " 20090302" " "
# $ burdur4: chr " 20090804" " " " " " 20081202"
dat[,3:8] <- lapply(dat[,3:8],as.Date,format = "%Y%m%d")
dat
# einst bu faeding forgun burdur1 burdur2 burdur3 burdur4
# 1 200416657210340 1665721 2004-06-08 2009-09-30 2006-09-10 2007-09-10 2008-08-27 2009-08-04
# 2 200416657210345 1665721 2004-09-07 2009-02-03 2007-03-31 2008-07-19 <NA> <NA>
# 3 200416657210347 1665721 2004-09-14 2009-10-26 2007-02-13 2008-01-14 2009-03-02 <NA>
# 4 200416657210352 1665721 2004-11-11 2010-03-15 2007-01-23 2007-12-05 <NA> 2008-12-02
str(dat)
# 'data.frame': 4 obs. of 8 variables:
# $ einst : chr "200416657210340" "200416657210345" "200416657210347" "200416657210352"
# $ bu : int 1665721 1665721 1665721 1665721
# $ faeding: Date,format: "2004-06-08" "2004-09-07" "2004-09-14" "2004-11-11"
# $ forgun : Date,format: "2009-09-30" "2009-02-03" "2009-10-26" "2010-03-15"
# $ burdur1: Date,format: "2006-09-10" "2007-03-31" "2007-02-13" "2007-01-23"
# $ burdur2: Date,format: "2007-09-10" "2008-07-19" "2008-01-14" "2007-12-05"
# $ burdur3: Date,format: "2008-08-27" NA "2009-03-02" NA
# $ burdur4: Date,format: "2009-08-04" NA NA "2008-12-02"
,
这里第一列的数字是非常大的数字,如果以整数或数字形式导入,它将自动以指数格式显示。解决此问题的方法是在读取文件之前设置 scipen。使用以下代码:
选项(scipen = 999)
我认为这应该可以解决您的问题。
以下是我运行的代码,当然对于您需要工作的日期列。为此,您可以使用简单的命令,如 as.Date(gripalisti$burdur1,format = "%Y%m%d")