使用相同任务的 svm 和 ranger 的不同运行时

问题描述

我对两个学习器的运行时间进行了基准测试，并在 {ranger} 和 {svm} 进行训练时截取了 {htop} 的两个屏幕截图，以使我的观点更加清晰。正如这篇文章的标题所述，我的问题是为什么 svm 中的训练/预测与其他学习者（在本例中为 ranger）相比如此慢？是否与学习者的底层结构有关？或者我在代码中犯了错误？要么...？任何帮助表示赞赏。

游侠训练时的htop；使用所有线程。

svm 训练时的 htop；只使用了 2 个线程。

代码：

library(mlr3verse)

library(future.apply)
#> Loading required package: future
library(future)
library(lgr)
library(data.table)

file <- 'https://drive.google.com/file/d/1Y3ao-SOFH/view?usp=sharing' 
#you can download the file using this link.
destfile = '/../Downloads/sample.RData'
download.file(file,destfile)


# pre-processing
sample$clc3_red <- as.factor(sample$clc3_red)
sample$X <- NULL
#> Warning in set(x,j = name,value = value): Column 'X' does not exist to remove
sample$confidence <- NULL
sample$ecoregion_id <- NULL
sample$gsw_occurrence_1984_2019 <- NULL

tsk_clf <- mlr3::TaskClassif$new(id = 'sample',backend = sample,target = "clc3_red")
tsk_clf$col_roles$group = 'tile_id' #spatial CV
tsk_clf$col_roles$feature = setdiff(tsk_clf$col_roles$feature,'tile_id')
tsk_clf$col_roles$feature = setdiff(tsk_clf$col_roles$feature,'x')
tsk_clf$col_roles$feature = setdiff(tsk_clf$col_roles$feature,'y')

# 2 learners for benchmarking
svm <- lrn("classif.svm",type = "C-classification",kernel = "radial",predict_type = "response")
ranger <- lrn("classif.ranger",predict_type = "response",importance = "permutation")

# ranger parallel
plan(multicore)
time <- Sys.time()
ranger$
  train(tsk_clf)$
  predict(tsk_clf)$
  score()
#> Warning: Dropped unused factor level(s) in dependent variable: 333,335,521,#> 522.
#> classif.ce 
#>     0.0116
Sys.time() - time
#> Time difference of 20.12981 secs

# svm parallel
plan(multicore)
time <- Sys.time()
svm$
  train(tsk_clf)$
  predict(tsk_clf)$
  score()
#> classif.ce 
#>     0.4361
Sys.time() - time
#> Time difference of 55.13694 secs

^{由 reprex package (v0.3.0) 于 2021 年 1 月 12 日创建}

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /opt/microsoft/ropen/4.0.2/lib64/R/lib/libRblas.so
#> LAPACK: /opt/microsoft/ropen/4.0.2/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  Grdevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] data.table_1.12.8   lgr_0.3.4           future.apply_1.6.0 
#>  [4] future_1.18.0       mlr3verse_0.1.3     paradox_0.3.0      
#>  [7] mlr3viz_0.1.1       mlr3tuning_0.1.2    mlr3pipelines_0.1.3
#> [10] mlr3learners_0.2.0  mlr3filters_0.2.0   mlr3_0.3.0         
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5         compiler_4.0.2     pillar_1.4.6       highr_0.8         
#>  [5] class_7.3-17       mlr3misc_0.3.0     tools_4.0.2        digest_0.6.25     
#>  [9] uuid_0.1-4         lattice_0.20-41    evaluate_0.14      lifecycle_0.2.0   
#> [13] tibble_3.0.3       checkmate_2.0.0    gtable_0.3.0       pkgconfig_2.0.3   
#> [17] rlang_0.4.7        Matrix_1.2-18      parallel_4.0.2     yaml_2.2.1        
#> [21] xfun_0.15          e1071_1.7-3        ranger_0.12.1      withr_2.2.0       
#> [25] stringr_1.4.0      knitr_1.29         globals_0.12.5     vctrs_0.3.2       
#> [29] grid_4.0.2         glue_1.4.1         listenv_0.8.0      R6_2.4.1          
#> [33] rmarkdown_2.3      ggplot2_3.3.2      magrittr_1.5       mlr3measures_0.2.0
#> [37] codetools_0.2-16   backports_1.1.8    scales_1.1.1       htmltools_0.5.0   
#> [41] ellipsis_0.3.1     colorspace_1.4-1   stringi_1.4.6      munsell_0.5.0     
#> [45] Crayon_1.3.4

解决方法

完全可以预期 SVM 的运行时间比随机森林长，因为 SVM 随着观察数量和特征数量的增加而严重扩展。使 SVM 更昂贵的事情：

Multiclass：在这里，对于每个类，训练一个 SVM，因为 SVM 本身就可以解决二分类
许多分类变量会增加特征的数量，因为它们通过虚拟编码转换为数字特征

有更快的近似 SVM，但通常不值得进一步研究，因为它们的性能通常优于基于树的方法（随机森林）或增强方法 (xgboost)。

另见：Why does my SVM take so long to run? 和 How much time does take train SVM classifier?

在简单的训练/预测调用中，您没有使用任何与 mlr3 相关的并行化。您看到的 ranger 是内部烘焙并行化（e1071 没有）。请参阅 num.threads 中的 ranger::ranger() 参数。

要使用与 mlr3 相关的并行化，您需要

调整超参数
运行交叉验证（即多次训练/预测调用）
对多个学习者进行基准测试

执行直接模型拟合的单个 train() 调用不受 mlr3 的影响。通常保持这种状态很好，因为它们非常快，并且启用内置学习器并行化可能会导致基于未来的并行化出现问题。

我没有查看您的数据集，但看到单个模型拟合的拟合时间，它可能/必须非常大。默认情况下，SVM 会优化 lambda 和 C，因此您可能会在此处看到 SVM 的拟合时间相当长。

mlr mlr3 r r r-ranger svm