使用spark/java基于条件连接两个Dataframe

wmomyfyw  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(517)

我在spark上有3个Dataframe:dataframe1、dataframe2和dataframe3。
我想根据条件将dataframe1与其他dataframe连接起来。
我使用以下代码:

Dataset <Row> df= dataframe1.filter(when(col("diffDate").lt(3888),dataframe1.join(dataframe2,
            dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
            and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
            and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe2.col("tracking_time").desc())).
                   otherwise(dataframe1.join(dataframe3,
                   dataframe3.col("id_device").equalTo(dataframe1.col("id_device")).
                           and(dataframe3.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                           and(dataframe3.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe3.col("tracking_time").desc())));

但我有个例外

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset

编辑
输入Dataframe:
Dataframe1

+-----------+-------------+-------------+-------------+
| diffDate  |id_device    |id_vehicule  |tracking_time|
+-----------+-------------+-------------+-------------+
|222        |1            |5            |2020-05-30   |          
|4700       |8            |9            |2019-03-01   |
+-----------+-------------+-------------+-------------+

Dataframe2

+-----------+-------------+-------------+-------------+
|id_device  |id_vehicule  |tracking_time|longitude    |
+-----------+-------------+-------------+-------------+
|1          |5            |2020-05-12   | 33.21111    |       
|8          |9            |2019-03-01   |20.2222      |
+-----------+-------------+-------------+-------------+

Dataframe3

+-----------+-------------+-------------+-------------+
|id_device  |id_vehicule  |tracking_time|latitude     |
+-----------+-------------+-------------+-------------+
|1          |5            |2020-05-12   | 40.333      |       
|8          |9            |2019-02-28   |2.00000      |
+-----------+-------------+-------------+-------------+

当diffdate<3888时

+-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate  |id_device    |id_vehicule  |tracking_time|id_device  |id_vehicule  |tracking_time|longitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|222        |1            |5            |2020-05-30   | 1          |5            |2020-05-12   | 33.21111    |       
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+

当diffdate>3888时

+-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate  |id_device    |id_vehicule  |tracking_time|id_device  |id_vehicule  |tracking_time|latitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|4700        |9            |5            |2019-03-01   | 8          |9            |2019-02-28   | 2.00000    |       
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+

我需要你的帮助
谢谢您。

wydwbb8l

wydwbb8l1#

我认为你需要重新审视你的代码。
您正在尝试为 dataframe1 (当然是基于条件),这是我认为不正确的要求或误解的要求。 when(condition, then).otherwise() 函数为基础Dataframe的每一行执行,通常用于根据条件处理列。 then 以及 else/otherwise 函数中的子句只支持 literals 是dataframe基元/复杂类型和文本中的现有列。不能将Dataframe或任何输出Dataframe的操作放在那里
也许你的要求是加入 datafrmae1datafrmae2 对于其中的行 col("diffDate").lt(3888) . 要实现这一点,您可以执行以下操作-

dataframe1.join(dataframe2,
                dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
                        and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                        and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time"))).
                        and(dataframe1.col("diffDate").lt(3888))
                )
                        .orderBy(dataframe2.col("tracking_time").desc())

编辑-1

dataframe1.as("a").join(dataframe2.as("b"),
                dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
                        and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                        and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time"))).
                        and(dataframe1.col("diffDate").lt(3888))
        ).selectExpr("a.*", "b.longitude", "null as latitude")
                .unionByName(
                        dataframe1.as("a").join(dataframe3.as("c"),
                                dataframe3.col("id_device").equalTo(dataframe1.col("id_device")).
                                        and(dataframe3.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                                        and(dataframe3.col("tracking_time").lt(dataframe1.col("tracking_time"))).
                                        and(dataframe1.col("diffDate").geq(3888))
                        ).selectExpr("a.*", "c.latitude", "null as longitude")

                )

相关问题