计算因素之间相关性的正确方法

问题描述

在python中计算FA模型因素之间相关性的正确方法是什么？我在 FA 上尝试了这种方法（在我的数据帧对象的因子分数上计算它），没有旋转或使用 varimax（想要计算斜旋转提取的因子之间的相关性，但首先决定检查正交旋转的计算方法），用于 sklearn.decomposition .FactorAnalysis 这显示了正确的结果（几乎是对角矩阵）但对于 factor_analyzer.FactorAnalyzer 矩阵与对角线有很大不同：

- 原因是什么？不要认为 factor_analyzer.FactorAnalyzer 有错误，那么计算相关性的正确方法是什么（以及为什么我应用的方法显示 sklearn.decomposition.FactorAnalysis 的正确结果）？

def get_correlations_between_factors(dataframe,fa_model):
    import pandas as pd

    fa_model.fit(dataframe)
    result = fa_model.transform(dataframe)
    return pd.DataFrame(result).corr()

解决方法

对于正交旋转（或根本没有旋转），FactorAnalyzer 不提供因子相关矩阵，因为它只是一个单位矩阵。通过倾斜旋转，您可以使用 phi_ 属性获取因子相关矩阵。

FactorAnalyzer 包的工作方式与 R 的 psych 包相同。例如，

library(tidyverse)
library(psych)

data(bfi)
bfi_subset = bfi %>% select(matches('^A[1-5]|^N'))
bfi_subset = bfi_subset %>% 
   mutate_all(~ifelse(is.na(.),median(.,na.rm = TRUE),.))

model = fa(bfi_subset,4,rotate = 'varimax')
print(model$Phi)
print(round(cor(model$scores),3))

结果如下：

NULL
       MR2    MR1    MR3    MR4
MR2  1.000 -0.044 -0.004 -0.283
MR1 -0.044  1.000  0.297 -0.014
MR3 -0.004  0.297  1.000  0.116
MR4 -0.283 -0.014  0.116  1.000

获取因子分数的相关性不一定会像您预期的那样工作。例如，

round(cor(model1$scores),3)

输出以下内容：

    MR2 MR1 MR3 MR4
MR2 1.000   -0.044  -0.004  -0.283
MR1 -0.044  1.000   0.297   -0.014
MR3 -0.004  0.297   1.000   0.116
MR4 -0.283  -0.014  0.116   1.000

如果您想向自己证明这些因素确实是正交的（不相关），您可以在 FactorAnalyzer 中执行以下操作：

fa = FactorAnalyzer(n_factors=4,rotation='varimax')
fa.fit(df_sub)
th = fa.rotation_matrix_
print(pd.DataFrame(th.T.dot(th)).round(2))

输出以下内容：

     0    1    2    3
0  1.0  0.0 -0.0 -0.0
1  0.0  1.0  0.0 -0.0
2 -0.0  0.0  1.0  0.0
3 -0.0 -0.0  0.0  1.0

这里要补充一点：当您使用 transform() 方法时，您正在计算因子分数。有不同的方法可以做到这一点。 FactorAnalyzer 实现了 Thurstone 方法，但还有其他方法可以保留底层相关性，例如“十 Berge”方法。

如果这是您希望我们在 FactorAnalyzer 中实施的内容，请随时提出问题。这是一个粗略的实现：

import warnings
import numpy as np
from sklearn.preprocessing import scale


def ten_berge(X,loadings,phi=None):
    """
    Estimate factor scores using the "ten Berge" method.

    Parameters
    ----------
    X : array-like
        The data set
    loadings : array-like
        The loadings matrix

    Reference
    ----------
    https://www.sciencedirect.com/science/article/pii/S0024379597100076
    """
    # get the number of factors from the loadings
    n_factors = loadings.shape[1]
    corr = np.corrcoef(X,rowvar=False)
    # if `phi` is None,create a diagonal matrix
    phi = np.diag(np.ones(n_factors)) if phi is None else phi
    # calculate intermediate metrics
    load = loadings.dot(matrix_sqrt(phi))
    corr_inv = inv_matrix_sqrt(corr)
    temp = corr_inv.dot(load)\
                   .dot(inv_matrix_sqrt(load.T.dot(np.linalg.inv(corr))
                                              .dot(load)))
    # calcualte weights
    weights = corr_inv.dot(temp)\
                      .dot(matrix_sqrt(phi))
    # calculate scores,given weights
    scores = scale(X).dot(weights)
    return scores


def matrix_sqrt(x):
    """
    Compute the square root of the eigen values (eVal),and then take $eVec * diag(eVals^0.5) * eVec^T$
    """
    evals,evecs = np.linalg.eig(x)
    evals[evals < 0] = np.finfo(float).eps
    sqrt_evals = np.sqrt(evals)
    return evecs.dot(np.diag(sqrt_evals)).dot(evecs.T)


def inv_matrix_sqrt(x):
    """
    Compute the inverse square root of the eigen values (eVal),and then take $eVec * diag(1 / eVals^0.5) * eVec^T$
    """
    evals,evecs = np.linalg.eig(x)
    if np.iscomplex(evals).any():
        warnings.warn('Complex eigen values detected; results are suspect.')
        return x
    evals[evals < np.finfo(float).eps] = 100 * np.finfo(float).eps
    inv_sqrt_evals =  1 / np.sqrt(evals)
    return evecs.dot(np.diag(inv_sqrt_evals)).dot(evecs.T)

df = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/psych/bfi.csv')
df = df.filter(regex='^A[1-5]|^N').copy()
df = df.fillna(df.median(0))

fa = FactorAnalyzer(n_factors=5,rotation=None).fit(df)
pd.DataFrame(ten_berge(df,fa.loadings_)).corr().round(3)

factor-analysis python scikit-learn statistics