我在spark上有3个Dataframe:dataframe1、dataframe2和dataframe3。
我想根据条件将dataframe1与其他dataframe连接起来。
我使用以下代码:
Dataset <Row> df= dataframe1.filter(when(col("diffDate").lt(3888),dataframe1.join(dataframe2,
dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe2.col("tracking_time").desc())).
otherwise(dataframe1.join(dataframe3,
dataframe3.col("id_device").equalTo(dataframe1.col("id_device")).
and(dataframe3.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
and(dataframe3.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe3.col("tracking_time").desc())));
但我有个例外
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset
编辑
输入Dataframe:
Dataframe1
+-----------+-------------+-------------+-------------+
| diffDate |id_device |id_vehicule |tracking_time|
+-----------+-------------+-------------+-------------+
|222 |1 |5 |2020-05-30 |
|4700 |8 |9 |2019-03-01 |
+-----------+-------------+-------------+-------------+
Dataframe2
+-----------+-------------+-------------+-------------+
|id_device |id_vehicule |tracking_time|longitude |
+-----------+-------------+-------------+-------------+
|1 |5 |2020-05-12 | 33.21111 |
|8 |9 |2019-03-01 |20.2222 |
+-----------+-------------+-------------+-------------+
Dataframe3
+-----------+-------------+-------------+-------------+
|id_device |id_vehicule |tracking_time|latitude |
+-----------+-------------+-------------+-------------+
|1 |5 |2020-05-12 | 40.333 |
|8 |9 |2019-02-28 |2.00000 |
+-----------+-------------+-------------+-------------+
当diffdate<3888时
+-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate |id_device |id_vehicule |tracking_time|id_device |id_vehicule |tracking_time|longitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|222 |1 |5 |2020-05-30 | 1 |5 |2020-05-12 | 33.21111 |
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+
当diffdate>3888时
+-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate |id_device |id_vehicule |tracking_time|id_device |id_vehicule |tracking_time|latitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|4700 |9 |5 |2019-03-01 | 8 |9 |2019-02-28 | 2.00000 |
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+
我需要你的帮助
谢谢您。
1条答案
按热度按时间wydwbb8l1#
我认为你需要重新审视你的代码。
您正在尝试为
dataframe1
(当然是基于条件),这是我认为不正确的要求或误解的要求。when(condition, then).otherwise()
函数为基础Dataframe的每一行执行,通常用于根据条件处理列。then
以及else/otherwise
函数中的子句只支持literals
是dataframe基元/复杂类型和文本中的现有列。不能将Dataframe或任何输出Dataframe的操作放在那里也许你的要求是加入
datafrmae1
与datafrmae2
对于其中的行col("diffDate").lt(3888)
. 要实现这一点,您可以执行以下操作-编辑-1