Azure:我是否需要 Azure ML 资源才能在 Azure databricks 笔记本中使用 AutoML?

问题描述

如果我想使用 AutoML 在 Python Databricks notebook 中训练模型,我是否需要 Azure 机器学习资源?如果 Databricks 有自己的计算,这似乎是一个不必要的资源

解决方法

如果我正确理解您的问题,是的 AutoML 和 Databricks ML 库是完全不同的东西。

通用随机森林回归:

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load and parse the data file,converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features,and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features",outputCol="indexedFeatures",maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData,testData) = data.randomSplit([0.7,0.3])

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer,rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction","label","features").show(5)

# Select (prediction,true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label",predictionCol="prediction",metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only

通用随机森林分类:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString,StringIndexer,VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file,converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels,adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",outputCol="indexedLabel").fit(data)

# Automatically identify categorical features,0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",featuresCol="indexedFeatures",numTrees=10)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",outputCol="predictedLabel",labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer,featureIndexer,rf,labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("predictedLabel",true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel",metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[2]
print(rfModel)  # summary only

请查看以下资源以了解更多信息。

https://spark.apache.org/docs/latest/ml-classification-regression.html

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...