R: read.fwf 将整数定义为数字

问题描述

我有一个 .txt 文件并且正在使用 Rstudio

200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804
200416657210345 1665721 20040907 20090203 20070331 20080719                  
200416657210347 1665721 20040914 20091026 20070213 20080114 20090302         
200416657210352 1665721 20041111 20100315 20070123 20071205          20081202

我正在尝试使用 read.fwf 读取 .txt 文件

gripalisti <- read.fwf(file = "gripalisti.txt",widths = c(15,8,9,9),header = FALSE,#stringsAsFactors = FALSE,col.names = c("einst","bu","faeding","forgun","burdur1","burdur2","burdur3","burdur4"))

这行得通,而且列的长度是正确的。 然而,“einst”和“bu”应该是整数值,其余应该是日期。

导入时第一列(ID 变量)中的所有值如下所示:

2.003140e+14

我一直在尝试寻找一种将导入的列更改为整数(或字符?)值的方法,但我没有发现任何不会导致错误内容一个例子,我在谷歌之后尝试过:

gripalisti <- read.fwf(file = "gripalisti.txt","burdur4"),colclasses = c("integer","integer","Date","Date"))

导致错误

Error in read.table(file = FILE,header = header,sep = sep,row.names = row.names,: 
  unused argument (colclasses = c("integer","Date"))

数据集中有许多缺失值超过 100.000 行。所以其他导入方式对我不起作用。数据集不是制表符分隔的。

对不起,如果这很明显,我是一个非常新的 R 用户

编辑:

多谢帮助,我改成:

 colClasses = c("character",

现在看起来不错。

解决方法

正如评论中所建议的:

  1. colClasses=,不是colclasses=,错别字;
  2. 第一个字段不能存储为 "integer",它必须是 "numeric""character"
  3. (另外)这些日期不是默认的 %Y-%m-%d 格式,您需要在读入数据后对其进行转换。

准备:

writeLines("200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804\n200416657210345 1665721 20040907 20090203 20070331 20080719                  \n200416657210347 1665721 20040914 20091026 20070213 20080114 20090302         \n200416657210352 1665721 20041111 20100315 20070123 20071205          20081202",con = "gripalisti.txt")

执行:

dat <- read.fwf("gripalisti.txt",widths = c(15,8,9,9),header = FALSE,col.names = c("einst","bu","faeding","forgun","burdur1","burdur2","burdur3","burdur4"),colClasses = c("character","integer","character","character"))
str(dat)
# 'data.frame': 4 obs. of  8 variables:
#  $ einst  : chr  "200416657210340" "200416657210345" "200416657210347" "200416657210352"
#  $ bu     : int  1665721 1665721 1665721 1665721
#  $ faeding: chr  " 20040608" " 20040907" " 20040914" " 20041111"
#  $ forgun : chr  " 20090930" " 20090203" " 20091026" " 20100315"
#  $ burdur1: chr  " 20060910" " 20070331" " 20070213" " 20070123"
#  $ burdur2: chr  " 20070910" " 20080719" " 20080114" " 20071205"
#  $ burdur3: chr  " 20080827" "         " " 20090302" "         "
#  $ burdur4: chr  " 20090804" "         " "         " " 20081202"

dat[,3:8] <- lapply(dat[,3:8],as.Date,format = "%Y%m%d")
dat
#             einst      bu    faeding     forgun    burdur1    burdur2    burdur3    burdur4
# 1 200416657210340 1665721 2004-06-08 2009-09-30 2006-09-10 2007-09-10 2008-08-27 2009-08-04
# 2 200416657210345 1665721 2004-09-07 2009-02-03 2007-03-31 2008-07-19       <NA>       <NA>
# 3 200416657210347 1665721 2004-09-14 2009-10-26 2007-02-13 2008-01-14 2009-03-02       <NA>
# 4 200416657210352 1665721 2004-11-11 2010-03-15 2007-01-23 2007-12-05       <NA> 2008-12-02

str(dat)
# 'data.frame': 4 obs. of  8 variables:
#  $ einst  : chr  "200416657210340" "200416657210345" "200416657210347" "200416657210352"
#  $ bu     : int  1665721 1665721 1665721 1665721
#  $ faeding: Date,format: "2004-06-08" "2004-09-07" "2004-09-14" "2004-11-11"
#  $ forgun : Date,format: "2009-09-30" "2009-02-03" "2009-10-26" "2010-03-15"
#  $ burdur1: Date,format: "2006-09-10" "2007-03-31" "2007-02-13" "2007-01-23"
#  $ burdur2: Date,format: "2007-09-10" "2008-07-19" "2008-01-14" "2007-12-05"
#  $ burdur3: Date,format: "2008-08-27" NA "2009-03-02" NA
#  $ burdur4: Date,format: "2009-08-04" NA NA "2008-12-02"
,

这里第一列的数字是非常大的数字,如果以整数或数字形式导入,它将自动以指数格式显示。解决此问题的方法是在读取文件之前设置 scipen。使用以下代码:

选项(scipen = 999)

enter image description here

我认为这应该可以解决您的问题。

以下是我运行的代码,当然对于您需要工作的日期列。为此,您可以使用简单的命令,如 as.Date(gripalisti$burdur1,format = "%Y%m%d")

enter image description here