scala—如何有效地获取Dataframe记录大小的最大长度

wz1wpwve 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(959)

如何知道数据集最长记录（行）的大小（字节）。
我有一个很大的Dataframe，记录长度（行）可变，我想知道负载最大的行的长度。
它有数百万\数十亿行，所以我想知道一种有效且不会影响性能的方法。
我有dataframewriter作为输入。

scala apache-spark bigdata

来源：https://stackoverflow.com/questions/63295804/how-to-get-the-max-length-of-the-record-size-of-a-dataframe-effectively

2条答案

按热度按时间

rqqzpn5f1#

也许这是有帮助的- bit_length ```
val df = Seq((1, 2, "hi", "hello")).toDF()

df.selectExpr("max(bit_length(concat_ws('', *)))/8 as bytes")
  .show(false)
/**
  * +-----+
  * |bytes|
  * +-----+
  * |9.0  |
  * +-----+
  */

赞(0）回复(0）举报 2021-05-27

eblbsuwk2#

检查以下代码。

scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils
scala> val bytes = udf((length:Long) => FileUtils.byteCountToDisplaySize(length)) // To disply human readable size.
bytes: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(LongType)))
scala> df
.withColumn("size",length(to_json(struct($"*"))))
.orderBy($"size".desc)
.select(bytes($"size").as("size_in_bytes"))
.show(10,false)
+-------------+
|size_in_bytes|
+-------------+
|49 KB        |
|49 KB        |
|49 KB        |
|48 KB        |
|48 KB        |
|48 KB        |
|43 KB        |
|43 KB        |
|43 KB        |
|42 KB        |
+-------------+
only showing top 10 rows
scala> df
.withColumn("size",length(to_json(struct($"*"))))
.orderBy($"size".desc)
.select($"size".as("size_in_bytes"))
.show(10,false)// Without UDF.
+-------------+
|size_in_bytes|
+-------------+
|50223        |
|50219        |
|50199        |
|50079        |
|50079        |
|50027        |
|44536        |
|44488        |
|44486        |
|43836        |
+-------------+
only showing top 10 rows
scala>

展开查看全部

赞(0）回复(0）举报 2021-05-27

我来回答

scala—如何有效地获取Dataframe记录大小的最大长度

2条答案

相关问题

热门标签

最新问答