为什么从因子变量的文档中强制转换这个因子变量会返回几个 NA？

问题描述

因子文档将此代码作为构造因子变量的第一个示例：

(ff <- factor(substring("statistics",1:10,1:10),levels = letters))

所述文档建议如下：

要将因子 f 转换为近似其原始数值，建议使用 as.numeric(levels(f))[f]，其效率略高于 as.numeric(as.character(f))。

但是当我在他们的例子中尝试这些时，我得到了废话：

> (ff <- factor(substring("statistics",levels = letters))
 [1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> ff
 [1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> as.numeric(levels(ff))[ff]
 [1] NA NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion 
> as.numeric(as.character(ff))
 [1] NA NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion

我的误会在哪里？我认为 ff 因子变量没有任何异常。它肯定有潜在的数字：

> as.integer(ff)
 [1] 19 20  1 20  9 19 20  9  3 19

虽然它的级别是字符，但我也不觉得有什么奇怪的 - 因子变量总是有字符的级别。

解决方法

一旦你创建了 ff 运行这个：table(ff)，它会告诉你每个字母的频率，即使是那些不存在的字母，它们的频率相应地为 0。

现在 levels(ff) 将所有这些字母作为字符返回，将它们包裹在 as.numeric(levels(ff)) 中将始终返回 NA。 as.numeric(as.character(ff)) 也是如此。

我的猜测是您可能对 labels 和 levels 感到困惑。如果您运行 labels(ff)，那么您将获得引用的数字 1 到 10。如果您使用 as.numeric() 进行转换。您将得到 10 个数字的结果。运行：as.numeric(labels(ff))

我希望这能解释您对什么感到困惑。否则请告诉我。

输出：

R>table(ff)
ff
a b c d e f g h i j k l m n o p q r 
1 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 
s t u v w x y z 
3 3 0 0 0 0 0 0 

R>levels(ff)
 [1] "a" "b" "c" "d" "e" "f" "g" "h"
 [9] "i" "j" "k" "l" "m" "n" "o" "p"
[17] "q" "r" "s" "t" "u" "v" "w" "x"
[25] "y" "z"

R>labels(ff)
 [1] "1"  "2"  "3"  "4"  "5"  "6" 
 [7] "7"  "8"  "9"  "10"

编辑：

好的，似乎 OP 在文档中的这一行有问题：

一个因素的解释取决于代码和 “级别”属性。小心只比较具有相同的因素一组级别（以相同的顺序）。特别是，as.numeric 应用到一个因素是没有意义的，并且可能通过隐式强制发生。到将因子 f 转换为近似其原始数值，推荐使用 as.numeric(levels(f))[f] ，效率稍高一些比 as.numeric(as.character(f)).

现在上面说如果你有因子（最初是数字），不要直接将它们转换成数字，例如：

nums <- c(1,2,3,10)
new_fact <- as.factor(nums)

现在，如果我们尝试从 new_fact 获取数字并运行 as.numeric(new_fact)，我们将得到 1,4（错误）！！！现在这是错误的，所以所有文档都说要转换为原始数字，必须执行 as.numeric(as.character(new_fact)) 或 as.numeric(levels(new_fact))[new_fact]，这两者都会返回 1 2 3 10。我希望这会有所帮助

categorical-data r r