如何在mlflow中添加系数，p值和相关的变量名？

问题描述

我正在运行一个线性回归模型，我想将每个变量的系数和P值以及变量名称添加到mlflow输出的度量中。我是使用mlflow的新手，对此并不十分熟悉。以下是部分代码的示例

with mlflow.start_run(run_name=p_key + '_' + str(o_key)):
    
    lr = LinearRegression(
      featuresCol = 'features',labelCol = target_var,maxIter = 10,regParam = 0.0,elasticNetParam = 0.0,solver="normal"
        )
    
    lr_model_item = lr.fit(train_model_data)
    lr_coefficients_item = lr_model_item.coefficients
    lr_coefficients_intercept = lr_model_item.intercept
    
    lr_predictions_item = lr_model_item.transform(train_model_data)
    lr_predictions_item_oos = lr_model_item.transform(test_model_data)
    
    rsquared = lr_model_item.summary.r2
    
    # Log mlflow attributes for mlflow UI
    mlflow.log_metric("rsquared",rsquared)
    mlflow.log_metric("intercept",lr_coefficients_intercept)
    for i in lr_coefficients_item:
      mlflow.log_metric('coefficients',lr_coefficients_item[i])

想知道这是否可能吗？在最终输出中，我应该具有截距，系数，p值和相关的变量名称。

解决方法

如果我对您的理解正确，那么您想在MLFlow中分别注册每个变量名称的p值和系数。 Spark ML的困难之处在于，通常将所有列合并到单个“功能”列中，然后再将其传递给给定的估算器（例如LinearRegression）。因此，人们松开了对哪个名称属于哪一列的监督。

我们可以通过定义以下函数[1]从线性模型中的“功能”列中获取每个功能的名称：

from itertools import chain

def feature_names(model,df):
  features_dict = df.schema[model.summary.featuresCol].metadata["ml_attr"]["attrs"].values()
  return sorted([(attr["idx"],attr["name"]) for attr in chain(*features_dict)])

上面的函数返回一个包含元组列表的排序列表，其中第一个条目对应于“功能”列中的特征索引，第二个条目对应于实际特征的名称。

通过在代码中使用上述功能，我们现在可以轻松地将特征名称与“功能”列中的列进行匹配，从而注册每个特征的系数和p值。

def has_pvalue(model):
  ''' Check if the given model supports pValues associated '''
  try:
    model.summary.pValues
    return True
  except:
    return False


with mlflow.start_run():
  lr = LinearRegression(
    featuresCol="features",labelCol="label",maxIter = 10,regParam = 1.0,elasticNetParam = 0.0,solver = "normal"
  )
  lr_model = lr.fit(train_data)

  mlflow.log_metric("rsquared",lr_model.summary.r2)
  mlflow.log_metric("intercept",lr_model.intercept)
  
  for index,name in feature_names(lr_model,train_data):
    mlflow.log_metric(f"Coef. {name}",lr_model.coefficients[index])
    if has_pvalue(lr_model):
      # P-values are not always available. This depends on the model configuration.
      mlflow.log_metric(f"P-val. {name}",lr_model.summary.pValues[index])

[1]：Related Stackoverflow question

databricks mlflow