如何修复spark中的stage failure错误,同时将Dataframe复制到spark?

30byixjq  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(349)

我一直在为这件事挣扎。在不同的执行时间,我总是犯不同的错误。
我有>4 gb的文件,我使用cli将其复制到dbfs filestore。我想把filestore中的csv文件复制到spark,但不知道怎么做。所以,我用r读取文件,然后尝试 copy_to spark,但我得到了以下错误。
星火会议

  1. sc <- spark_connect(method = "databricks",
  2. spark_home = Sys.getenv("SPARK_HOME"),
  3. version = "2.4")
  4. R version 3.6.3 (2020-02-29)
  5. Platform: x86_64-pc-linux-gnu (64-bit)
  6. Running under: Ubuntu 18.04.5 LTS
  7. Matrix products: default
  8. BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  9. LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
  10. locale:
  11. [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
  12. [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
  13. [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
  14. [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
  15. [9] LC_ADDRESS=C LC_TELEPHONE=C
  16. [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
  17. attached base packages:
  18. [1] stats graphics grDevices utils datasets methods base
  19. other attached packages:
  20. [1] rlang_0.4.7 sparklyr_1.3.1 forcats_0.5.0 stringr_1.4.0
  21. [5] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2
  22. [9] tibble_3.0.3 ggplot2_3.3.0 tidyverse_1.3.0
  23. loaded via a namespace (and not attached):
  24. [1] httr_1.4.2 pkgload_1.0.2 jsonlite_1.7.1 modelr_0.1.6
  25. [5] assertthat_0.2.1 blob_1.2.1 cellranger_1.1.0 yaml_2.2.1
  26. [9] remotes_2.2.0 r2d3_0.2.3 sessioninfo_1.1.1 pillar_1.4.6
  27. [13] backports_1.1.9 lattice_0.20-41 glue_1.4.2 digest_0.6.25
  28. [17] rvest_0.3.5 colorspace_1.4-1 htmltools_0.5.0 pkgconfig_2.0.3
  29. [21] devtools_2.3.1 broom_0.5.6 haven_2.3.1 config_0.3
  30. [25] scales_1.1.0 processx_3.4.2 TeachingDemos_2.10 generics_0.0.2
  31. [29] usethis_1.6.0 ellipsis_0.3.1 withr_2.2.0 cli_2.0.2
  32. [33] magrittr_1.5 crayon_1.3.4 Rserve_1.8-7 readxl_1.3.1
  33. [37] memoise_1.1.0 ps_1.3.2 fs_1.4.1 fansi_0.4.1
  34. [41] nlme_3.1-147 xml2_1.3.2 hwriter_1.3.2 pkgbuild_1.0.6
  35. [45] tools_3.6.3 prettyunits_1.1.1 hms_0.5.3 lifecycle_0.2.0
  36. [49] munsell_0.5.0 reprex_0.3.0 callr_3.4.3 compiler_3.6.3
  37. [53] forge_0.2.0 grid_3.6.3 rstudioapi_0.11 htmlwidgets_1.5.1
  38. [57] base64enc_0.1-3 testthat_2.3.2 gtable_0.3.0 DBI_1.1.0
  39. [61] curl_4.3 R6_2.4.1 hwriterPlus_1.0-3 lubridate_1.7.8
  40. [65] rprojroot_1.3-2 desc_1.2.0 stringi_1.5.3 parallel_3.6.3
  41. [69] Rcpp_1.0.4.6 vctrs_0.3.4 SparkR_3.0.0 dbplyr_1.4.4
  42. [73] tidyselect_1.1.0

正在将文件读取到r

  1. df <- read.csv(file_location, header = T, na.strings=c(" ", "", "NA"))
  2. ## copy files to spark cluster
  3. df_tbl <- copy_to(sc, df=df, name="df_tbl")
  4. ``` `: org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 88:0 was 527499668 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 88:0 was 527499668 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.`

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题