我有一张table如下。
+----------+---------------------+-----------------+----------+--------------------------+--------------------------+-----------------+-----------------------+------------------+------------------------+
|cust_pr_id|cust_pr_name |now_prcs_status |pr_join_dt|installation_due |installation_completed |seg_purchase_due |seg_purchase_completed |wire_in_line_due |wire_in_line_completed |
+----------+---------------------+-----------------+----------+--------------------------+--------------------------+-----------------+-----------------------+------------------+------------------------+
|9822647220|Jonathan RM Berlin |installation |20200202 |2019-07-11 08:00:00.000000|2019-07-10 12:20:30.132000|1562745600000 |1562761216526 | | |
|7166582305|Paola RM Berlin |seg purchase |20200903 |2019-07-11 08:00:00.000000|2019-07-10 12:20:30.132000|1562745600000 |1562761216526 | | |
|9964201263|Roy RM Poland |installation |20201023 |2019-07-11 08:00:00.000000|2019-07-10 12:20:30.132000|1562745600000 |1562761216526 | | |
|7288402221|Katerina RM Mia |wire in line |20201110 |2019-07-11 08:00:00.000000|2019-07-10 12:20:30.132000|1562745600000 |1562761216526 | | |
|8424182826|Smidge RM Siberia |seg purchase |20200902 |2019-07-15 08:00:00.000000|2019-07-10 09:11:30.599000| | | | |
|4445859610|Donna RM Brazil |seg purchase |20200903 |2019-07-15 08:00:00.000000|2019-07-10 09:11:30.599000| | | | |
+----------+---------------------+-----------------+----------+--------------------------+--------------------------+-----------------+-----------------------+------------------+------------------------+
根据这些数据,我想构建一个数据集,如下所示。这里,如果字段“now\u prcs\u status”的值是“installation”,那么我需要将“installation\u due”的值生成为“curr\u prcs\u due”,将“installation\u completed”的值生成为“curr\u prcs\u completed”。同样,如果“now\u prcs\u status”的值是“seg purchase”,我需要将“seg\u purchase\u due”的值生成为“curr\u prcs\u due”,将“seg\u purchase\u completed”的值生成为“curr\u prcs\u completed”。当“cust\u pr\u name”的值为“wire in line”时,我需要将其到期和完成的值分别填充为“curr\u prcs\u due”和“seg\u purchase\u completed”。
+----------+---------------------+-----------------+----------+--------------------------+--------------------------+
|cust_pr_id|cust_pr_name |now_prcs_status |pr_join_dt|curr_prcs_due |curr_prcs_completed |
+----------+---------------------+-----------------+----------+--------------------------+--------------------------+
|9822647220|Jonathan RM Berlin |installation |20200202 |2019-07-11 08:00:00.000000|2019-07-10 12:20:30.132000|
|7166582305|Paola RM Berlin |seg purchase |20200903 |1562745600000 |1562761216526 |
|9964201263|Roy RM Poland |installation |20201023 |2019-07-11 08:00:00.000000|2019-07-10 12:20:30.132000|
|7288402221|Katerina RM Mia |wire in line |20201110 | | |
|8424182826|Smidge RM Siberia |seg purchase |20200902 | | |
|4445859610|Donna RM Brazil |seg purchase |20200903 | | |
+----------+---------------------+-----------------+----------+--------------------------+--------------------------+
以上为期望值。
我不想使用sql case语句,因为在我的实际数据集中cust\u pr\u name总共有105个不同的值,我不想最终编写105个case语句。
有人能帮我通过Pypark或hive实现这一点吗。。
谢谢您!
1条答案
按热度按时间d6kp6zgx1#
有些人应该做这项工作: