问题描述
试图弄清楚这个错误意味着什么以及如何解决它。我在 Spark 3.0 中使用 sparklyr 来解决使用随机森林的多分类问题。在进行特征工程之前,我的数据如下所示:
数据大约有 100 万行:
Source: spark<?> [?? x 8]
label_detail duration orig_bytes resp_bytes proto history time_diff_from_last_connection resp_class
<chr> <dbl> <int> <int> <chr> <chr> <dbl> <chr>
1 okiru 0 0 0 tcp S 0.000250 A
2 okiru 0 0 0 tcp S 0.000250 B
3 okiru 0 0 0 tcp S 0.000250 C
4 okiru 0 0 0 tcp S 0.000250 A
5 okiru 0 0 0 tcp S 0.000250 B
然后我使用 ml 管道如下:
pipline <- ml_pipeline(sc) %>%
ft_string_indexer("label_detail","label_detail_idx") %>%
ft_string_indexer("proto","proto_idx") %>%
ft_string_indexer("resp_class","resp_class_idx") %>%
ft_one_hot_encoder(
input_cols = c( "proto_idx","resp_class_idx"),output_cols = c( "proto_encode","resp_class_encode")) %>%
ft_regex_tokenizer("history","history_token",pattern = "") %>%
ft_count_vectorizer(input_col = "history_token",output_col = "history_vector") %>%
ft_vector_assembler(
input_cols = c("duration","orig_bytes","resp_bytes","proto_encode","time_diff_from_last_connection","resp_class_encode","history_vector"),output_col = "features") %>%
ml_random_forest_classifier(label_col="label_detail_idx",features_col="features",seed=222)
model_rf<-ml_fit(pipline,zeek_train)
运行 ml_fit
时出现以下错误:
> model_rf<-ml_fit(pipline,zeek_train)
Error in as.character(call[[1]]) :
cannot coerce type 'closure' to vector of type 'character'
在使用 https://therinspark.com/pipelines.html Mastering Spark with R
中的数据和示例时,我也遇到了同样的错误okc_train <- spark_read_parquet(sc,"data/okc-train.parquet")
okc_train <- okc_train %>%
select(not_working,age,sex,drinks,drugs,essay1:essay9,essay_length)
pipeline <- ml_pipeline(sc) %>%
ft_string_indexer(input_col = "sex",output_col = "sex_indexed") %>%
ft_string_indexer(input_col = "drinks",output_col = "drinks_indexed") %>%
ft_string_indexer(input_col = "drugs",output_col = "drugs_indexed") %>%
ft_one_hot_encoder(
input_cols = c("sex_indexed","drinks_indexed","drugs_indexed"),output_cols = c("sex_encoded","drinks_encoded","drugs_encoded")
) %>%
ft_vector_assembler(
input_cols = c("age","sex_encoded","drugs_encoded","essay_length"),output_col = "features"
) %>%
ft_standard_scaler(input_col = "features",output_col = "features_scaled",with_mean = TRUE) %>%
ml_logistic_regression(features_col = "features_scaled",label_col = "not_working")
ml_fit(pipeline,okc_train)
解决方法
我在书中的例子中遇到了同样的错误,所以我在错误消息之后使用了 traceback() 来找出更多细节。该功能似乎需要 Spark 3.0 版本。