我在尝试使用sparkyr::spark\u write\u csv()编写2个数据集时遇到问题。这是我的配置:
# Configure cluster
config <- spark_config()
config$spark.yarn.keytab <- "mykeytab.keytab"
config$spark.yarn.principal <- "myyarnprincipal"
config$sparklyr.gateway.start.timeout <- 10
config$spark.executor.instances <- 2
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
config$spark.driver.memory <- "4G"
config$spark.kryoserializer.buffer.max <- "1G"
Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH/lib/spark")
Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf.cloudera.hdfs')
Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf.cloudera.yarn')
# Configure cluster
sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.0')
一旦spark上下文成功创建,我将尝试使用spark\u write\u csv()在hdfs上保存2个数据集。作为中间步骤,我需要将Dataframe转换为tbl\u spark。不幸的是,我只能正确地保存第一个文件,而第二个文件(对于hadoop标准(即360mb)来说,它更大,但绝对不是很大)需要很长时间,最后会崩溃。
# load datasets
tmp_small <- read.csv("first_one.csv", sep = "|") # 13 MB
tmp_big <- read.csv("second_one.csv", sep = "|") # 352 MB
tmp_small_Spark <- sdf_copy_to(sc, tmp_small, "tmp_small", memory = F, overwrite = T)
tables_preview <- dbGetQuery(sc, "SHOW TABLES")
tmp_big_Spark <- sdf_copy_to(sc, tmp_big, "tmp_big", memory = F, overwrite = T) # fail!!
tables_preview <- dbGetQuery(sc, "SHOW TABLES")
这可能是一个配置问题,但我想不出来。这是错误: |================================================================================| 100% 352 MB
```
Error in invoke_method.spark_shell_connection(sc, TRUE, class, method, :
No status is returned. Spark R backend might have failed.
谢谢
1条答案
按热度按时间q5lcpyga1#
我也有问题加载更大的文件。尝试将其添加到spark连接配置文件:
不过,这是个解决办法。