我是新的Spark,我想做一个广播连接,在此之前,我试图得到我的Dataframe的大小,我想广播。。有没有办法找到Dataframe的大小。我使用python作为spark的编程语言非常感谢您的帮助
u4dcyp6a1#
如果要查找以字节为单位的大小以及以行计数为单位的大小,请遵循以下步骤-
// ### Alternative -1 /** * file content * spark-test-data.json * -------------------- * {"id":1,"name":"abc1"} * {"id":2,"name":"abc2"} * {"id":3,"name":"abc3"} */ val fileName = "spark-test-data.json" val path = getClass.getResource("/" + fileName).getPath spark.catalog.createTable("df", path, "json") .show(false) /** * +---+----+ * |id |name| * +---+----+ * |1 |abc1| * |2 |abc2| * |3 |abc3| * +---+----+ */ // Collect only statistics that do not require scanning the whole table (that is, size in bytes). spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN") spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false) /** * +----------+---------+-------+ * |col_name |data_type|comment| * +----------+---------+-------+ * |Statistics|68 bytes | | * +----------+---------+-------+ */ spark.sql("ANALYZE TABLE df COMPUTE STATISTICS") spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false) /** * +----------+----------------+-------+ * |col_name |data_type |comment| * +----------+----------------+-------+ * |Statistics|68 bytes, 3 rows| | * +----------+----------------+-------+ */
// ### Alternative 2 val df = spark.range(10) df.createOrReplaceTempView("myView") spark.sql("explain cost select * from myView").show(false) /** * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ * |plan | * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ * |== Optimized Logical Plan == * Range (0, 10, step=1, splits=Some(2)), Statistics(sizeInBytes=80.0 B, hints=none) * * == Physical Plan == * *(1) Range (0, 10, step=1, splits=2)| * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ */
// ### altervative 3 println(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes) // 80
1条答案
按热度按时间u4dcyp6a1#
如果要查找以字节为单位的大小以及以行计数为单位的大小,请遵循以下步骤-
备选方案-1
备选方案-2
备选方案-3