pyspark 无法在Spark 3.0中设置“spark.driver.maxResultSize”

cnjp1d6j  于 2023-03-11  发布在  Spark
关注(0)|答案(1)|浏览(227)

我尝试将spark Dataframe 转换为panda Dataframe 。我有一个足够大的驱动程序。我尝试设置spark.driver.maxResultSize值,如下所示

spark = (
        SparkSession
        .builder
        .appName('test')
        .enableHiveSupport()
        .config("spark.sql.execution.arrow.pyspark.enabled", "true")
        .config("spark.driver.maxResultSize","0")
        .getOrCreate()
    )

但作业失败,错误如下:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of XXXX tasks (1026.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
e4yzc0pl

e4yzc0pl1#

在3.0.0版中,您的错误由以下代码段触发:

/**
   * Check whether has enough quota to fetch the result with `size` bytes
   */
  def canFetchMoreResults(size: Long): Boolean = sched.synchronized {
    totalResultSize += size
    calculatedTasks += 1
    if (maxResultSize > 0 && totalResultSize > maxResultSize) {
      val msg = s"Total size of serialized results of ${calculatedTasks} tasks " +
        s"(${Utils.bytesToString(totalResultSize)}) is bigger than ${config.MAX_RESULT_SIZE.key} " +
        s"(${Utils.bytesToString(maxResultSize)})"
      logError(msg)
      abort(msg)
      false
    } else {
      true
    }
  }

正如你所看到的,如果maxResultSize == 0,你将永远不会得到你得到的错误,再往上一点,你会看到maxResultSize来自config.MAX_RESULT_SIZE,并且在这段代码中,spark.driver.maxResultSize最终定义了config.MAX_RESULT_SIZE

private[spark] val MAX_RESULT_SIZE = ConfigBuilder("spark.driver.maxResultSize")
    .doc("Size limit for results.")
    .version("1.2.0")
    .bytesConf(ByteUnit.BYTE)
    .createWithDefaultString("1g")

结论

您正在尝试正确的事情!让spark.driver.maxResultSize等于0在Spark 3.0中也是有效的。正如您在错误消息中所看到的,您的config.MAX_RESULT_SIZE似乎仍然等于默认值1024MB
这意味着你的配置可能没有通过。我会调查你的整个设置。你是如何提交你的应用程序?你的主人是什么?你的spark.sql.execution.arrow.pyspark.enabled配置通过了吗?

相关问题