pyspark Spark物理计划：ColumnarToRow运算符中输入批处理数的含义

dfty9e19 于 2023-01-01 发布在 Spark

关注(0)|答案(1)|浏览(468)

我正在查看一个通过运行Spark查询生成的物理计划。该查询读取一个parquet文件并进行一些聚合。在物理计划中，有一个名为ColumnarToRow的操作符，它有一个名为"number of input batchs"的统计信息。我很好奇这个输入批处理数是如何确定的？它似乎取决于parquet文件中行组的数量，但不完全取决于。
下面是我的代码：

df1 = spark.read.parquet('data/')
           .select('col1')
           .groupby('col1')
           .agg(f.count('col1').alias('ct'))
           .toPandas()

下面是ColumnarToRow运算符统计信息：

ColumnarToRow
number of output rows: 327,069
number of input batches: 80

pyspark

来源：https://stackoverflow.com/questions/74967205/spark-physical-plan-meaning-of-number-of-input-batches-in-columnartorow-operato

1条答案

按热度按时间

bnl4lu3b1#

这个ColumnarToRow块的存在是因为你正在阅读一个parquet文件，Parquet文件以一种面向列的方式存储，这带来了很多好处。
但是在Apache Spark中，RDD是以面向行的方式存储的，这使我们能够高效地执行map、reduce、groupBy等经典操作。
现在，如果我们快速浏览一下生成您所谈论的这些数字的源代码（使用Spark v3.3.1），我们会在Columnar.scala中看到以下代码：

override lazy val metrics: Map[String, SQLMetric] = Map(
  "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
  "numInputBatches" -> SQLMetrics.createMetric(sparkContext, "number of input batches")
)
override def doExecute(): RDD[InternalRow] = {
  val numOutputRows = longMetric("numOutputRows")
  val numInputBatches = longMetric("numInputBatches")
  // This avoids calling `output` in the RDD closure, so that we don't need to include the entire
  // plan (this) in the closure.
  val localOutput = this.output
  child.executeColumnar().mapPartitionsInternal { batches =>
    val toUnsafe = UnsafeProjection.create(localOutput, localOutput)
    batches.flatMap { batch =>
      numInputBatches += 1
      numOutputRows += batch.numRows()
      batch.rowIterator().asScala.map(toUnsafe)
    }
  }
}

我们可以看到numInputBatches瓦尔在mapPartitionsInternal函数中递增（numInputBatches += 1），这意味着numInputBatches表示您正在阅读的parquet文件的分区数！
你应该能够通过在一个（py）Spark壳中进行以下操作来验证这一点：

spark.read.parquet('data/').rdd.getNumPartitions()

希望这有帮助！

展开查看全部

赞(0）回复(0）举报 2023-01-01

我来回答

pyspark Spark物理计划：ColumnarToRow运算符中输入批处理数的含义

1条答案

相关问题

热门标签

最新问答