R中具有虚拟因变量和分类自变量的线性回归模型

问题描述

我的数据是虚拟变量（1 = 如果披露，0 = 未披露）作为因变量和分类变量（五种类型的部门）作为自变量。

有了这些数据，可以使用线性回归模型吗？

我的目标是确定哪些部门披露或不披露。

那么好用吗？例如：

summary(lm(disclosed ~ 0 + Sectors,data = df_0))

我在模型中加入“0 +”，这样它也返回第一个扇区，消除截取。如果我不添加它，我不知道为什么第一个扇区不返回给我。我很失落。谢谢！

如果我使用二项逻辑回归，则不会解释我使用它所指示的估计符号获得的显着性值。

Call:
glm(formula = disclosed ~ 0 + Sectors,family = binomial(link = "logit"),data = df_0)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.96954  -0.32029  -0.00005  -0.00005   2.48638  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
SectorsCOMMUNICATION      -0.5108     0.5164  -0.989  0.32256    
SectorsCONSIMERSTAPLES   -20.5661  6268.6324  -0.003  0.99738    
SectorsCONSUMERdisCRET    -3.0445     1.0235  -2.975  0.00293 ** 
SectorsENERGY            -20.5661  3780.1276  -0.005  0.99566    
SectorsFINANCIALS         -2.9444     0.7255  -4.059 4.94e-05 ***
SectorsHEALTHCARE        -20.5661  5345.9077  -0.004  0.99693    
SectorsINDUSTRIALS       -20.5661  2803.4176  -0.007  0.99415    
SectorsINDUSTRIALS       -20.5661 17730.3699  -0.001  0.99907    
SectorsinformatION        -1.0986     0.8165  -1.346  0.17846    
SectorsMATERIALS         -20.5661  3780.1276  -0.005  0.99566    
SectorsREALESTATE        -20.5661  8865.1850  -0.002  0.99815    
SectorsUTILITIES         -20.5661  7238.3932  -0.003  0.99773    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(dispersion parameter for binomial family taken to be 1)

    Null deviance: 277.259  on 200  degrees of freedom
Residual deviance:  54.185  on 188  degrees of freedom
AIC: 78.185

Number of Fisher Scoring iterations: 19

这意味着金融和非必需消费品行业的披露最少，对吗？

另一方面，如果我应用 lm，它会返回更一致的结果。传播最多的部门是信息和通信。它们是显着的正估计值

Call:
lm(formula = disclosed ~ 0 + Sectors,data = df_0)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3750 -0.0500  0.0000  0.0000  0.9546 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
SectorsCOMMUNICATION   3.750e-01  5.191e-02   7.224 1.22e-11 ***
SectorsCONSIMERSTAPLES 0.000e+00  7.341e-02   0.000 1.000000    
SectorsCONSUMERdisCRET 4.545e-02  4.427e-02   1.027 0.305815    
SectorsENERGY          0.000e+00  4.427e-02   0.000 1.000000    
SectorsFINANCIALS      5.000e-02  3.283e-02   1.523 0.129426    
SectorsHEALTHCARE      0.000e+00  6.260e-02   0.000 1.000000    
SectorsINDUSTRIALS     2.194e-18  3.283e-02   0.000 1.000000    
SectorsINDUSTRIALS     0.000e+00  2.076e-01   0.000 1.000000    
SectorsinformatION     2.500e-01  7.341e-02   3.406 0.000807 ***
SectorsMATERIALS       0.000e+00  4.427e-02   0.000 1.000000    
SectorsREALESTATE      0.000e+00  1.038e-01   0.000 1.000000    
SectorsUTILITIES       1.416e-17  8.476e-02   0.000 1.000000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2076 on 188 degrees of freedom
Multiple R-squared:  0.2632,Adjusted R-squared:  0.2162 
F-statistic: 5.597 on 12 and 188 DF,p-value: 3.568e-08

解决方法

对于这个特定问题最好使用逻辑回归。

关于线性回归输出，对于分类输入（自变量），lm 将按字母顺序排列的第一类/类别作为 intercept 中显示的基类，并返回其他类的相对结果。

在示例中，类别 A 将被拦截，我们将获得其他类与类 A 的相对结果

例如

set.seed(100)

a <- sample(c(1,0),100,replace = TRUE)
b <- sample(c('A','B','C','D','E'),replace = TRUE)

lm(a ~ b)
Call:
lm(formula = a ~ b)

Coefficients:
(Intercept)           bB           bC           bD           bE  
   0.562500    -0.183190     0.104167    -0.107955    -0.006944

与

相同

Call:
lm(formula = a ~ 0 + b)

Coefficients:
    bA      bB      bC      bD      bE  
0.5625  0.3793  0.6667  0.4545  0.5556

c <- broom::tidy(lm(a ~ 0 + b))
c$estimate
[1] 0.5625000 0.3793103 0.6666667 0.4545455 0.5555556

d <- broom::tidy(lm(a ~ b))
d$estimate
[1]  0.562500000 -0.183189655  0.104166667 -0.107954545 -0.006944444

d$estimate[2:5] + d$estimate[1]
[1] 0.3793103 0.6666667 0.4545455 0.5555556

correlation dummy-variable linear-regression r r statistics