如何解释read.table中的单撇号/引号?

问题描述

我有以下数据框:

1            1                                        What percent of the world\xd5s population is between 15 and 64 years old?
2            2                                               What percent of the world\xd5s airports are in the United States? 
3            3                                            The area of the USA is what percent of the area of the Pacific Ocean?
4            4                                                      What percent of the earth\xd5s surface is covered by water?
5            5 What percent of the goods exported worldwide are mineral fuels (including oil,coal,gas,and refined products)?
6            6                    What percent of the world\xd5s countries have a higher fertility rate than the United States?
7            7                        What percent of the worldwide gross domestic product (GDP) comes from the service sector?
8            8                                    What percent of the worldwide income does the richest 10% of households earn?
9            9      What percent of the worldwide gross domestic product (GDP) is re-invested (\xd2gross fixed investment\xd3)?
10          10                                      What percent of the worldwide labor force works in the agricultural sector?
11          11                                             What percent of the worldwide land mass is not used for agriculture?
12          12                           What percent of the world\xd5s population speaks Mandarin Chinese as a first language?
13          13                What percentage of the world\xd5s countries have a higher life expectancy than the United States?
14          14                             What percent of the world\xd5s population aged 15 years or older can read and write?
15          15      What percent of the worldwide gross domestic product (GDP) is used for the military (military expenditure)?
16          16                                                    Saudi Arabia consumes what percentage of the oil it produces?
17          17                   What percent of the world\xd5s population lives in either China,India,or the European Union?
18          18                                                          What percent of the world\xd5s population is Christian?
19          19                                                               What percent of the world\xd5s roads are in India?
20          20                         What percent of the world\xd5s telephone lines are in China,USA,or the European Union?

每个问题中对于world'searth's之类的所有格单词都应该有一个撇号,并且正如您所看到的,它的阅读方式与我所希望的不同。我尝试使用这种DF <- read.table("mydata.csv",header=TRUE,sep="\t",quote="")这样的表达式无济于事。令人惊讶的是,要找到这个问题的答案非常困难。

解决方法

如果无法通过选择更好的读取方法来解决此问题,则可以使用正则表达式进行固化;例如:

x <- "What percent of the world\xd5s population"
gsub("\\\xd5","'",x)
[1] "What percent of the world's population"

您似乎还有其他不幸的撇号转换;可以通过其他方式来解决这些问题(但有趣的是,不能通过正则表达式的缩写形式,例如\\d来表示数字)

x <- c("What percent of the world\xd5s population","gross domestic product (GDP) is re-invested (\xd2gross fixed investment\xd3)")
gsub("\\\xd5|\\\xd2|\\\xd3",x)
[1] "What percent of the world's population"                                
[2] "gross domestic product (GDP) is re-invested ('gross fixed investment')"
,

您可以使用readLines来读取表格,并利用前两列在一起总是显示14个字符这一事实。

r <- trimws(readLines(file("mydata.csv")))

res <- data.frame(do.call(rbind,strsplit(substring(r,1,14),"\\s+")),X3=trimws(substring(r,15,nchar(r))))

然后进行清洁。

within(res,{
  X1 <- as.numeric(X1)
  X2 <- as.numeric(X2)
  X3 <- gsub("\\\\xd5",X3)
  X3 <- gsub("\\\\xd2|\\\\xd3",'"',X3)
})
#    X1 X2                                                                                                               X3
# 1   1  1                                           What percent of the world's population is between 15 and 64 years old?
# 2   2  2                                                   What percent of the world's airports are in the United States?
# 3   3  3                                            The area of the USA is what percent of the area of the Pacific Ocean?
# 4   4  4                                                         What percent of the earth's surface is covered by water?
# 5   5  5 What percent of the goods exported worldwide are mineral fuels (including oil,coal,gas,and refined products)?
# 6   6  6                       What percent of the world's countries have a higher fertility rate than the United States?
# 7   7  7                        What percent of the worldwide gross domestic product (GDP) comes from the service sector?
# 8   8  8                                    What percent of the worldwide income does the richest 10% of households earn?
# 9   9  9            What percent of the worldwide gross domestic product (GDP) is re-invested ("gross fixed investment")?
# 10 10 10                                      What percent of the worldwide labor force works in the agricultural sector?
# 11 11 11                                             What percent of the worldwide land mass is not used for agriculture?
# 12 12 12                              What percent of the world's population speaks Mandarin Chinese as a first language?
# 13 13 13                   What percentage of the world's countries have a higher life expectancy than the United States?
# 14 14 14                                What percent of the world's population aged 15 years or older can read and write?
# 15 15 15      What percent of the worldwide gross domestic product (GDP) is used for the military (military expenditure)?
# 16 16 16                                                    Saudi Arabia consumes what percentage of the oil it produces?
# 17 17 17                      What percent of the world's population lives in either China,India,or the European Union?
# 18 18 18                                                             What percent of the world's population is Christian?
# 19 19 19                                                                  What percent of the world's roads are in India?
# 20 20 20                            What percent of the world's telephone lines are in China,USA,or the European Union?
,

字符串

What percent of the world\xd5s population is between 15 and 64 years old?

最有可能是读取包含非ASCII字符的文本文件的结果。在这里,序列\xd5代表文件使用的编码形式的左单引号,而不是4个字符\ x d 5。同样,\xd2\xd3分别代表左和右双引号。因此,您的文件已被正确读取,只是没有按照您期望的方式打印。

要将\xd5转换为常规ASCII引号:

gsub("\xd5",x)  # no extra backslashes needed

类似地,将\xd2\xd3转换为ASCII双引号:

gsub("\xd2|\xd3",x)

(如果您使用的是R read.table(*,stringsAsFactors=FALSE)读取数据。)

,

我最终找到了 Alt AltType FreeText 0 1000 MSL Test string 1 2000 AGL other string 2 10000 MSL xxxx SFC-10000ft MSLXXX 的答案

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...