将数据帧复制到Spark时，如何解决Spark中的阶段失败错误？

问题描述

我一直在为此苦苦挣扎。我在执行的不同时间不断收到不同的错误。

我有> 4 GB的文件，已使用cli复制到dbfs文件存储中。我想复制文件存储中的csv文件以触发，但不知道如何。因此，我使用r读取了文件，然后尝试发出copy_to的火花，但是出现以下错误。

火花会话

sc <- spark_connect(method = "databricks",spark_home = Sys.getenv("SPARK_HOME"),version = "2.4")

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  Grdevices utils     datasets  methods   base     

other attached packages:
 [1] rlang_0.4.7     sparklyr_1.3.1  forcats_0.5.0   stringr_1.4.0  
 [5] dplyr_1.0.2     purrr_0.3.4     readr_1.3.1     tidyr_1.1.2    
 [9] tibble_3.0.3    ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] httr_1.4.2         pkgload_1.0.2      jsonlite_1.7.1     modelr_0.1.6      
 [5] assertthat_0.2.1   blob_1.2.1         cellranger_1.1.0   yaml_2.2.1        
 [9] remotes_2.2.0      r2d3_0.2.3         sessioninfo_1.1.1  pillar_1.4.6      
[13] backports_1.1.9    lattice_0.20-41    glue_1.4.2         digest_0.6.25     
[17] rvest_0.3.5        colorspace_1.4-1   htmltools_0.5.0    pkgconfig_2.0.3   
[21] devtools_2.3.1     broom_0.5.6        haven_2.3.1        config_0.3        
[25] scales_1.1.0       processx_3.4.2     TeachingDemos_2.10 generics_0.0.2    
[29] usethis_1.6.0      ellipsis_0.3.1     withr_2.2.0        cli_2.0.2         
[33] magrittr_1.5       Crayon_1.3.4       Rserve_1.8-7       readxl_1.3.1      
[37] memoise_1.1.0      ps_1.3.2           fs_1.4.1           fansi_0.4.1       
[41] nlme_3.1-147       xml2_1.3.2         hwriter_1.3.2      pkgbuild_1.0.6    
[45] tools_3.6.3        prettyunits_1.1.1  hms_0.5.3          lifecycle_0.2.0   
[49] munsell_0.5.0      reprex_0.3.0       callr_3.4.3        compiler_3.6.3    
[53] forge_0.2.0        grid_3.6.3         rstudioapi_0.11    htmlwidgets_1.5.1 
[57] base64enc_0.1-3    testthat_2.3.2     gtable_0.3.0       DBI_1.1.0         
[61] curl_4.3           R6_2.4.1           hwriterPlus_1.0-3  lubridate_1.7.8   
[65] rprojroot_1.3-2    desc_1.2.0         stringi_1.5.3      parallel_3.6.3    
[69] Rcpp_1.0.4.6       vctrs_0.3.4        SparkR_3.0.0       dbplyr_1.4.4      
[73] tidyselect_1.1.0

将文件读取到r

df <- read.csv(file_location,header = T,na.strings=c(" ","","NA"))
## copy files to spark cluster

df_tbl <- copy_to(sc,df=df,name="df_tbl")

: org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 88:0 was 527499668 bytes,which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 88:0 was 527499668 bytes,which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark databricks sparklyr