如何知道数据集最长记录(行)的大小(字节)。我有一个很大的Dataframe,记录长度(行)可变,我想知道负载最大的行的长度。它有数百万\数十亿行,所以我想知道一种有效且不会影响性能的方法。我有dataframewriter作为输入。
rqqzpn5f1#
也许这是有帮助的- bit_length ```val df = Seq((1, 2, "hi", "hello")).toDF()
bit_length
df.selectExpr("max(bit_length(concat_ws('', *)))/8 as bytes") .show(false)/** * +-----+ * |bytes| * +-----+ * |9.0 | * +-----+ */
df.selectExpr("max(bit_length(concat_ws('', *)))/8 as bytes")
.show(false)
/**
* +-----+
* |bytes|
* |9.0 |
*/
eblbsuwk2#
检查以下代码。
scala> import org.apache.commons.io.FileUtilsimport org.apache.commons.io.FileUtilsscala> val bytes = udf((length:Long) => FileUtils.byteCountToDisplaySize(length)) // To disply human readable size.bytes: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(LongType)))scala> df.withColumn("size",length(to_json(struct($"*")))).orderBy($"size".desc).select(bytes($"size").as("size_in_bytes")).show(10,false)+-------------+|size_in_bytes|+-------------+|49 KB ||49 KB ||49 KB ||48 KB ||48 KB ||48 KB ||43 KB ||43 KB ||43 KB ||42 KB |+-------------+only showing top 10 rowsscala> df.withColumn("size",length(to_json(struct($"*")))).orderBy($"size".desc).select($"size".as("size_in_bytes")).show(10,false)// Without UDF.+-------------+|size_in_bytes|+-------------+|50223 ||50219 ||50199 ||50079 ||50079 ||50027 ||44536 ||44488 ||44486 ||43836 |+-------------+only showing top 10 rowsscala>
scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils
scala> val bytes = udf((length:Long) => FileUtils.byteCountToDisplaySize(length)) // To disply human readable size.
bytes: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(LongType)))
scala> df
.withColumn("size",length(to_json(struct($"*"))))
.orderBy($"size".desc)
.select(bytes($"size").as("size_in_bytes"))
.show(10,false)
+-------------+
|size_in_bytes|
|49 KB |
|48 KB |
|43 KB |
|42 KB |
only showing top 10 rows
.select($"size".as("size_in_bytes"))
.show(10,false)// Without UDF.
|50223 |
|50219 |
|50199 |
|50079 |
|50027 |
|44536 |
|44488 |
|44486 |
|43836 |
scala>
2条答案
按热度按时间rqqzpn5f1#
也许这是有帮助的-
bit_length
```val df = Seq((1, 2, "hi", "hello")).toDF()
eblbsuwk2#
检查以下代码。