你知道怎么用PySpark写这个吗?
我有两个PySpark DataFrame要合并。但是,有一个值我想基于2个重复的列值进行更新。
吡啶衍生物1:
+-----------+-----------+-----------+------------+
|test_date |student_id |take_home |grade |
+-----------+-----------+-----------+------------+
| 2022-09-26| 655| N| A|
| 2022-09-26| 656| N| B|
| 2022-09-26| 657| N| C|
| 2022-09-26| 658| N| D|
+-----------+-----------+-----------+------------+
PyDf2:
+-----------+-----------+-----------+------------+
|test_date |student_id |take_home |grade |
+-----------+-----------+-----------+------------+
| 2022-09-27| 655| N| D|
| 2022-09-27| 656| N| C|
| 2022-09-27| 657| N| B|
| 2022-09-27| 658| N| A|
| 2022-09-26| 655| N| B| <- Duplicate test_date & student_id, different grade
+-----------+-----------+-----------+------------+
所需输出:
+-----------+-----------+-----------+------------+
|test_date |student_id |take_home |grade |
+-----------+-----------+-----------+------------+
| 2022-09-26| 655| N| B| <- Updated to B for grade
| 2022-09-26| 656| N| B|
| 2022-09-26| 657| N| C|
| 2022-09-26| 658| N| D|
| 2022-09-27| 655| N| D|
| 2022-09-27| 656| N| C|
| 2022-09-27| 657| N| B|
| 2022-09-27| 658| N| A|
+-----------+-----------+-----------+------------+
2条答案
按热度按时间kqqjbcuj1#
使用窗口函数。逻辑和代码如下
i1icjdpr2#
想出了解决办法:
1.并集两个表
1.添加索引列
1.使用parititionBy(Windows函数)指定row_number编号
1.筛选行和列