我有一个Dataframe,它有开始日期,结束日期,销售目标。我添加了代码来标识日期范围之间的季度数,并相应地使用一些自定义项将销售目标拆分为季度数。
df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"])
+----------+----------+------------+
|start_date| end_date |sales_target|
+----------+----------+------------+
|2020-01-01|2020-12-31| 15|
|2020-04-01|2020-12-31| 11|
|2020-07-01|2020-12-31| 3|
+----------+----------+------------+
以下是计算季度数并使用自定义项函数拆分销售目标后的Dataframe。
spark.sql('select *, round(months_between(end_date, start_date)/3) as noq from df_temp').createOrReplaceTempView("df_temp")
spark.sql("select *, st_udf(cast(sales_target as integer), cast(noq as integer)) as sales_target from df_temp").createOrReplaceTempView("df_temp")
+----------+----------+--------+---------------+
|start_date| end_date |num_qtrs|sales_target_n |
+----------+----------+--------+---------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] |
|2020-04-01|2020-12-31| 3| [4,4,3] |
|2020-07-01|2020-12-31| 2| [2,1] |
+----------+----------+--------+---------------+
在完成销售目标后,我能够得到以下结果:
+----------+----------+--------+-------------+---------------+------------------+
|start_date| end_date |num_qtrs|sales_target |sales_target_n | sales_target_new |
+----------+----------+--------+-------------+---------------+------------------+
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 3 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 3 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 2 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 1 |
+----------+----------+--------+-------------+---------------+------------------+
我需要帮助为每行添加不同的开始/结束日期,具体取决于num\u qtrs值。我需要得到一个Dataframe如下。
+----------+----------+--------+-------------+------------------+--------------+--------------+
|start_date| end_date |num_qtrs|sales_target | sales_target_new |new_start_date| new_end_date |
+----------+----------+--------+-------------+------------------+--------------+--------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-01-01 |2020-03-31 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-07-01|2020-12-31| 2| [2,1] | 2 |2020-07-01 |2020-09-30 |
|2020-07-01|2020-12-31| 2| [2,1] | 1 |2020-10-01 |2020-12-31 |
+----------+----------+--------+-------------+------------------+--------------+--------------+
有人能帮我用pyspark代码示例来实现上面想要的结果吗。
序列错误时更新:
谢谢
3条答案
按热度按时间5rgfhyps1#
在应用自定义项后,考虑将下面的内容作为输入Dataframe。
输入:
你可以使用
row_number
,add_months
以及date_add
要获得所需的输出,如下所示,输出:
您可以修改
window
根据您的要求。uxhixvfz2#
试试这个-
需要
start_date
以及end_date
计算new_start_date
以及new_end_date
###加载提供的测试数据计算新的开始日期和结束日期
h79rfbju3#