如何在pyspark中进行广播连接之前获得Dataframe的大小

8wtpewkr  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(633)

我是新的Spark,我想做一个广播连接,在此之前,我试图得到我的Dataframe的大小,我想广播。。
有没有办法找到Dataframe的大小。
我使用python作为spark的编程语言
非常感谢您的帮助

u4dcyp6a

u4dcyp6a1#

如果要查找以字节为单位的大小以及以行计数为单位的大小,请遵循以下步骤-

备选方案-1

// ### Alternative -1
    /**
      * file content
      * spark-test-data.json
      * --------------------
      * {"id":1,"name":"abc1"}
      * {"id":2,"name":"abc2"}
      * {"id":3,"name":"abc3"}
      */
    val fileName = "spark-test-data.json"
    val path = getClass.getResource("/" + fileName).getPath

    spark.catalog.createTable("df", path, "json")
      .show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |1  |abc1|
      * |2  |abc2|
      * |3  |abc3|
      * +---+----+
      */
    // Collect only statistics that do not require scanning the whole table (that is, size in bytes).
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+---------+-------+
      * |col_name  |data_type|comment|
      * +----------+---------+-------+
      * |Statistics|68 bytes |       |
      * +----------+---------+-------+
      */
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+----------------+-------+
      * |col_name  |data_type       |comment|
      * +----------+----------------+-------+
      * |Statistics|68 bytes, 3 rows|       |
      * +----------+----------------+-------+
      */

备选方案-2

// ### Alternative 2

    val df = spark.range(10)
    df.createOrReplaceTempView("myView")
    spark.sql("explain cost select * from myView").show(false)

    /**
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |plan                                                                                                                                                                    |
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |== Optimized Logical Plan ==
      * Range (0, 10, step=1, splits=Some(2)), Statistics(sizeInBytes=80.0 B, hints=none)
      *
      * == Physical Plan ==
      * *(1) Range (0, 10, step=1, splits=2)|
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      */

备选方案-3

// ### altervative 3
    println(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes) 
// 80

相关问题