我有一个sparkDataframe,如下所示 struct
现场。
val arrayStructData = Seq(
Row("James",Row("Java","XX",120)),
Row("Michael",Row("Java","",200)),
Row("Robert",Row("Java","XZ",null)),
Row("Washington",Row("","XX",120))
)
val arrayStructSchema = new StructType().add("name",StringType).add("my_struct", new StructType().add("name",StringType).add("author",StringType).add("pages",IntegerType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
df.printSchema()
root
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- pages: integer (nullable = true)
df.show(false)
+----------+---------------+
|name |my_struct |
+----------+---------------+
|James |[Java, XX, 120]|
|Michael |[Java, , 200] |
|Robert |[Java, XZ,] |
|Washington|[, XX, 120] |
+----------+---------------+
我想构造一个名为 final_list
显示结构中缺少或存在元素。问题是,在本例中,struct元素仅限于3个,但在实际数据中,struct中有1000个元素,每个记录可能包含也可能不包含每个元素中的值。
下面是我要如何构造列-
val cleaned_df = spark.sql(s"""select name, case when my_struct.name = "" then "" else "name" end as name_present
, case when my_struct.author = "" then "" else "author" end as author_present
, case when my_struct.pages = "" then "" else "pages" end as pages_present
from df""")
cleaned_df.createOrReplaceTempView("cleaned_df")
cleaned_df.show(false)
+----------+------------+--------------+-------------+
|name |name_present|author_present|pages_present|
+----------+------------+--------------+-------------+
|James |name |author |pages |
|Michael |name | |pages |
|Robert |name |author |pages |
|Washington| |author |pages |
+----------+------------+--------------+-------------+
所以我写了一个 case
每个列的语句,以捕获其存在或不存在。然后我做下面的concat得到最终的输出-
val final_df = spark.sql(s"""
select name, concat_ws("," , name_present, author_present, pages_present) as final_list
from cleaned_df
""")
final_df.show(false)
+----------+-----------------+
|name |final_list |
+----------+-----------------+
|James |name,author,pages|
|Michael |name,,pages |
|Robert |name,author,pages|
|Washington|,author,pages |
+----------+-----------------+
我不能用一个巨大的case语句来捕捉1000个元素的结构。有没有更聪明的方法?也许是自定义项?
我正在使用spark 2.4.3。我不知道是否有任何高阶函数支持这一点。但我的真实Dataframe的模式如下所示-
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- element3: integer (nullable = true)
| |-- element4: string (nullable = true)
| |-- element5: double (nullable = true)
.....
.....
| |-- element1000: string (nullable = true)
2条答案
按热度按时间aelbi1ox1#
你已经提到了一个自定义项。使用自定义项,您可以迭代我的结构的所有字段并收集标志:
印刷品
dly7yett2#
没有
UDF
.架构
构造表达式
将表达式应用于Dataframe