pyspark 如何追加/合并两个spark Dataframe ?

goucqfw6  于 2023-03-28  发布在  Spark
关注(0)|答案(1)|浏览(240)

我是一个新的Spark我想合并两个 Dataframe 到一个单一的表与一些组的逻辑。尝试使用收集列表和收集集,但没有得到正确的输出

Parent table
+-------------------+------+--------------------+
|      individual_id|  name|                 age|
+-------------------+------+--------------------+
|1.00000000000000000|vishal|30.00000000000000000|
|1.00000000000000000|vishal|30.00000000000000000|
+-------------------+------+--------------------+

另一个子表作为

+-------------------+--------------+-----------------------+
|           order_id|sum_item_price|sales_order_product_dlm|
+-------------------+--------------+-----------------------+
|1.00000000000000000|       1500.00|   [{2.0000000000000...|
+-------------------+--------------+-----------------------+

如果我将子数据框转换为单列数据框,转换为json列值

childataframe.toJson().show();
+--------------------+
|               value|
+--------------------+
|{"order_id":1.000...|
+--------------------+

ExpectedOutput我想将子json值列合并到父数据框中,并在individual_id列上进行group by,这样输出将如下所示

+-------------------+------+--------------------+--------+
|      individual_id|  name|                 age| value
+-------------------+------+--------------------+--------
|1.00000000000000000|vishal|30.00000000000000000|{"order_id":1.000...
+-------------------+------+--------------------+----------------

子数据框和父数据框都属于同一方案

2vuwiymt

2vuwiymt1#

导入必要的包

from pyspark.sql.types import StringType
from pyspark.sql.functions import get_json_object
child_df = spark.createDataFrame(child_df.toJSON(), StringType())
child_df = child_df.withColumn("order_id", get_json_object("value", "$.order_id"))

parent_df.drop_duplicates() \
.join(child_df, col("individual_id") == col("order_id"), "inner") \
.drop("order_id").show()

输出

+-------------+------+----+--------------------+
|individual_id|  name| age|               value|
+-------------+------+----+--------------------+
|          1.0|vishal|30.0|{"order_id":1.0,"...|
+-------------+------+----+--------------------+

相关问题