pyspark 将日期范围划分为pysark中另一个DataFrame的特定周数

snvhrwxg  于 2022-11-21  发布在  Spark
关注(0)|答案(3)|浏览(123)

我有这样一个pysark DataFrames:
df1:

+--------------------+----------+----------+----------+----------+----------+------+-----------------+--------+
|            NAME    |  X_NAME  | BEGIN    |   END    |         A|         B|     C|                D|       E|
+--------------------+----------+----------+----------+----------+----------+------+-----------------+--------+
|whatever1           |       XYZ|2021-09-27|2021-10-03|       0.0|       1.0|   0.0|              0.0|     0.0|
|whatever2           |       XYZ|2021-09-27|2021-10-03|       0.0|       1.0|   0.0|              0.0|     0.0|
|whatever3           |       XYZ|2021-10-04|2021-10-10|       0.0|       1.0|   0.0|              0.0|     0.0|
|whatever4           |       XYZ|2021-10-04|2021-10-10|       0.0|       1.0|   0.0|              0.0|     0.0|
|whatever6           |       XYZ|2021-10-18|2021-10-24|       0.0|       0.0|   1.0|              0.0|     0.0|
|whatever9           |       XYZ|2021-10-25|2021-10-31|       0.0|       1.0|   0.0|              0.0|     0.0|
...
...
...

DF2:

+-------------------+-----+----+-------+
|      start_of_week|month|year|week_no|
+-------------------+-----+----+-------+
|2021-12-06 00:00:00|   12|2021|2021W49|
|2021-12-13 00:00:00|   12|2021|2021W50|
|2021-12-20 00:00:00|   12|2021|2021W51|
|2021-12-27 00:00:00|   12|2021|2021W52|
|2022-01-03 00:00:00|    1|2022| 2022W1|
|2022-01-10 00:00:00|    1|2022| 2022W2|
|2022-01-17 00:00:00|    1|2022| 2022W3|
|2022-01-24 00:00:00|    1|2022| 2022W4|
|2022-01-31 00:00:00|    2|2022| 2022W5|
|2022-02-07 00:00:00|    2|2022| 2022W6|
|2020-11-16 00:00:00|   11|2020|2020W47|
|2020-11-23 00:00:00|   11|2020|2020W48|
|2020-11-30 00:00:00|   12|2020|2020W49|
|2020-12-07 00:00:00|   12|2020|2020W50|
|2020-12-14 00:00:00|   12|2020|2020W51|
|2020-12-21 00:00:00|   12|2020|2020W52|
|2020-12-28 00:00:00|   12|2020|2020W53|
|2021-01-04 00:00:00|    1|2021| 2021W1|
|2021-01-11 00:00:00|    1|2021| 2021W2|
|2020-07-06 00:00:00|    7|2020|2020W28|
|2020-07-13 00:00:00|    7|2020|2020W29|
|2020-07-20 00:00:00|    7|2020|2020W30|
|2020-07-27 00:00:00|    7|2020|2020W31|
|2020-08-03 00:00:00|    8|2020|2020W32|
|2020-08-10 00:00:00|    8|2020|2020W33|
|2020-08-17 00:00:00|    8|2020|2020W34|
|2020-08-24 00:00:00|    8|2020|2020W35|
|2020-08-31 00:00:00|    9|2020|2020W36|
|2020-09-07 00:00:00|    9|2020|2020W37|
|2021-03-22 00:00:00|    3|2021|2021W12|
|2021-03-29 00:00:00|    4|2021|2021W13|
|2021-04-05 00:00:00|    4|2021|2021W14|
|2021-04-12 00:00:00|    4|2021|2021W15|
|2021-04-19 00:00:00|    4|2021|2021W16|
|2021-04-26 00:00:00|    4|2021|2021W17|
|2021-05-03 00:00:00|    5|2021|2021W18|
|2021-05-10 00:00:00|    5|2021|2021W19|
|2021-05-17 00:00:00|    5|2021|2021W20|
|2021-05-24 00:00:00|    5|2021|2021W21|
|2022-08-22 00:00:00|    8|2022|2022W34|
|2022-08-29 00:00:00|    9|2022|2022W35|
|2022-09-05 00:00:00|    9|2022|2022W36|
|2022-09-12 00:00:00|    9|2022|2022W37|
|2022-09-19 00:00:00|    9|2022|2022W38|
|2022-09-26 00:00:00|    9|2022|2022W39|
|2022-10-03 00:00:00|   10|2022|2022W40|
|2022-10-10 00:00:00|   10|2022|2022W41|
|2022-10-17 00:00:00|   10|2022|2022W42|
|2022-10-24 00:00:00|   10|2022|2022W43|
|2020-09-14 00:00:00|    9|2020|2020W38|
|2020-09-21 00:00:00|    9|2020|2020W39|
|2020-09-28 00:00:00|   10|2020|2020W40|
|2020-10-05 00:00:00|   10|2020|2020W41|
|2020-10-12 00:00:00|   10|2020|2020W42|
|2020-10-19 00:00:00|   10|2020|2020W43|
|2020-10-26 00:00:00|   10|2020|2020W44|
|2020-11-02 00:00:00|   11|2020|2020W45|
|2020-11-09 00:00:00|   11|2020|2020W46|
|2020-05-04 00:00:00|    5|2020|2020W19|
|2020-05-11 00:00:00|    5|2020|2020W20|
|2020-05-18 00:00:00|    5|2020|2020W21|
|2020-05-25 00:00:00|    5|2020|2020W22|
|2020-06-01 00:00:00|    6|2020|2020W23|
|2020-06-08 00:00:00|    6|2020|2020W24|
|2020-06-15 00:00:00|    6|2020|2020W25|
|2020-06-22 00:00:00|    6|2020|2020W26|
|2020-06-29 00:00:00|    7|2020|2020W27|
|2021-10-04 00:00:00|   10|2021|2021W40|
|2021-10-11 00:00:00|   10|2021|2021W41|
|2021-10-18 00:00:00|   10|2021|2021W42|
|2021-10-25 00:00:00|   10|2021|2021W43|
|2021-11-01 00:00:00|   11|2021|2021W44|
|2021-11-08 00:00:00|   11|2021|2021W45|
|2021-11-15 00:00:00|   11|2021|2021W46|
|2021-11-22 00:00:00|   11|2021|2021W47|
|2021-11-29 00:00:00|   12|2021|2021W48|
|2022-02-14 00:00:00|    2|2022| 2022W7|
|2022-02-21 00:00:00|    2|2022| 2022W8|
|2022-02-28 00:00:00|    3|2022| 2022W9|
|2022-03-07 00:00:00|    3|2022|2022W10|
|2022-03-14 00:00:00|    3|2022|2022W11|
|2022-03-21 00:00:00|    3|2022|2022W12|
|2022-03-28 00:00:00|    3|2022|2022W13|
|2022-04-04 00:00:00|    4|2022|2022W14|
|2022-04-11 00:00:00|    4|2022|2022W15|
|2022-04-18 00:00:00|    4|2022|2022W16|
|2022-04-25 00:00:00|    4|2022|2022W17|
|2022-05-02 00:00:00|    5|2022|2022W18|
|2022-05-09 00:00:00|    5|2022|2022W19|
|2022-05-16 00:00:00|    5|2022|2022W20|
|2022-05-23 00:00:00|    5|2022|2022W21|
|2022-05-30 00:00:00|    6|2022|2022W22|
|2022-06-06 00:00:00|    6|2022|2022W23|
|2022-06-13 00:00:00|    6|2022|2022W24|
|2022-06-20 00:00:00|    6|2022|2022W25|
|2022-06-27 00:00:00|    6|2022|2022W26|
|2022-07-04 00:00:00|    7|2022|2022W27|
|2022-07-11 00:00:00|    7|2022|2022W28|
|2022-07-18 00:00:00|    7|2022|2022W29|
|2022-07-25 00:00:00|    7|2022|2022W30|
|2022-08-01 00:00:00|    8|2022|2022W31|
|2022-08-08 00:00:00|    8|2022|2022W32|
|2022-08-15 00:00:00|    8|2022|2022W33|
|2021-01-18 00:00:00|    1|2021| 2021W3|
|2021-01-25 00:00:00|    1|2021| 2021W4|
|2021-02-01 00:00:00|    2|2021| 2021W5|
|2021-02-08 00:00:00|    2|2021| 2021W6|
|2021-02-15 00:00:00|    2|2021| 2021W7|
|2021-02-22 00:00:00|    2|2021| 2021W8|
|2021-03-01 00:00:00|    3|2021| 2021W9|
|2021-03-08 00:00:00|    3|2021|2021W10|
|2021-03-15 00:00:00|    3|2021|2021W11|
|2020-03-02 00:00:00|    3|2020|2020W10|
|2020-03-09 00:00:00|    3|2020|2020W11|
|2020-03-16 00:00:00|    3|2020|2020W12|
|2020-03-23 00:00:00|    3|2020|2020W13|
|2020-03-30 00:00:00|    4|2020|2020W14|
|2020-04-06 00:00:00|    4|2020|2020W15|
|2020-04-13 00:00:00|    4|2020|2020W16|
|2020-04-20 00:00:00|    4|2020|2020W17|
|2020-04-27 00:00:00|    4|2020|2020W18|
|2021-05-31 00:00:00|    6|2021|2021W22|
|2021-06-07 00:00:00|    6|2021|2021W23|
|2021-06-14 00:00:00|    6|2021|2021W24|
|2021-06-21 00:00:00|    6|2021|2021W25|
|2021-06-28 00:00:00|    7|2021|2021W26|
|2021-07-05 00:00:00|    7|2021|2021W27|
|2021-07-12 00:00:00|    7|2021|2021W28|
|2021-07-19 00:00:00|    7|2021|2021W29|
|2021-07-26 00:00:00|    7|2021|2021W30|
|2021-08-02 00:00:00|    8|2021|2021W31|
|2021-08-09 00:00:00|    8|2021|2021W32|
|2021-08-16 00:00:00|    8|2021|2021W33|
|2021-08-23 00:00:00|    8|2021|2021W34|
|2021-08-30 00:00:00|    9|2021|2021W35|
|2021-09-06 00:00:00|    9|2021|2021W36|
|2021-09-13 00:00:00|    9|2021|2021W37|
|2021-09-20 00:00:00|    9|2021|2021W38|
|2021-09-27 00:00:00|    9|2021|2021W39|
|2019-12-30 00:00:00|    1|2020| 2020W1|
|2020-01-06 00:00:00|    1|2020| 2020W2|
|2020-01-13 00:00:00|    1|2020| 2020W3|
|2020-01-20 00:00:00|    1|2020| 2020W4|
|2020-01-27 00:00:00|    1|2020| 2020W5|
|2020-02-03 00:00:00|    2|2020| 2020W6|
|2020-02-10 00:00:00|    2|2020| 2020W7|
|2020-02-17 00:00:00|    2|2020| 2020W8|
|2020-02-24 00:00:00|    2|2020| 2020W9|
+-------------------+-----+----+-------+

我希望将这些BEGINEND范围划分为更小的单位-第二个DataFrame中的周数。因此,最终的DataFrame将只有week_no列,而不是BEGINEND。如果范围大于一周,则记录将相乘,以具有多个周数。
例如:

+--------------------+----------+----------+----------+----------+----------+------+-----------------+--------+
|            NAME    |  X_NAME  | BEGIN    |   END    |         A|         B|     C|                D|       E|
+--------------------+----------+----------+----------+----------+----------+------+-----------------+--------+
|whatever345         |       XYZ|2021-12-07|2021-12-14|       0.0|       1.0|   0.0|              0.0|     0.0|

会是:

+--------------------+----------+----------+----------+----------+------+-----------------+--------+
|            NAME    |  X_NAME  | week_no  |         A|         B|     C|                D|       E|
+--------------------+----------+----------+----------+----------+------+-----------------+--------+
|whatever345         |       XYZ|   2021W49|       0.0|       1.0|   0.0|              0.0|     0.0|
|whatever345         |       XYZ|   2021W50|       0.0|       1.0|   0.0|              0.0|     0.0|
lsmd5eda

lsmd5eda1#

-我已经使用标准的周数和年函数回答了这个问题;但是你的数据框架似乎使用了不同的逻辑。2我将在这个单独的帖子中用你的数据框架作为参考来回答这个问题。

其逻辑是将“开始”日期设置为上一个星期一,将“END”日期设置为下一个星期日。然后创建“BEGIN”和“END”日期之间所有星期一的列表。最后,将这些星期一与“start_of_week”表单df 2连接起来:

import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, StringType

@F.udf(returnType=ArrayType(StringType()))
def week_range(begin, end):
  import datetime
  import math
  begin_dt = datetime.datetime.strptime(begin, "%Y-%m-%d").date()
  begin_dt = begin_dt - datetime.timedelta(days=begin_dt.weekday())
  end_dt = datetime.datetime.strptime(end, "%Y-%m-%d").date()
  end_dt = end_dt + datetime.timedelta(days=6-end_dt.weekday())
  return [(begin_dt + datetime.timedelta(days=i*7)).strftime('%Y-%m-%d') for i in range (0, math.floor((end_dt-begin_dt).days / 7) + 1)]

df = df.withColumn("start_of_week_list", week_range("BEGIN", "END"))
df = df.withColumn("start_of_week_list", F.explode("start_of_week_list"))

df2 = df2.withColumn("start_of_week", F.to_date("start_of_week", format="yyyy-MM-dd HH:mm:ss"))
df2 = df2.withColumn("start_of_week", F.date_format("start_of_week", format="yyyy-MM-dd"))
df = df.join(df2, df.start_of_week_list==df2.start_of_week, how="inner")
df = df.drop("BEGIN", "END", "start_of_week_list", "start_of_week", "month", "year")

[Out]:
+-----------+------+---+---+---+---+---+-------+
|NAME       |X_NAME|A  |B  |C  |D  |E  |week_no|
+-----------+------+---+---+---+---+---+-------+
|whatever9  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W43|
|whatever345|XYZ   |0.0|1.0|0.0|0.0|0.0|2021W49|
|whatever345|XYZ   |0.0|1.0|0.0|0.0|0.0|2021W50|
|whatever3  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W40|
|whatever4  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W40|
|whatever1  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W39|
|whatever2  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W39|
|whatever6  |XYZ   |0.0|0.0|1.0|0.0|0.0|2021W42|
|whatever100|XYZ   |0.9|1.9|0.9|0.9|0.9|2020W52|
|whatever100|XYZ   |0.9|1.9|0.9|0.9|0.9|2021W1 |
|whatever100|XYZ   |0.9|1.9|0.9|0.9|0.9|2020W53|
+-----------+------+---+---+---+---+---+-------+

使用的 Dataframe :

df = spark.createDataFrame(data=[
      ["whatever1","XYZ","2021-09-27","2021-10-03",0.0,1.0,0.0,0.0,0.0],
      ["whatever2","XYZ","2021-09-27","2021-10-03",0.0,1.0,0.0,0.0,0.0],
      ["whatever3","XYZ","2021-10-04","2021-10-10",0.0,1.0,0.0,0.0,0.0],
      ["whatever4","XYZ","2021-10-04","2021-10-10",0.0,1.0,0.0,0.0,0.0],
      ["whatever6","XYZ","2021-10-18","2021-10-24",0.0,0.0,1.0,0.0,0.0],
      ["whatever9","XYZ","2021-10-25","2021-10-31",0.0,1.0,0.0,0.0,0.0],
      ["whatever100","XYZ","2020-12-21","2021-01-10",0.9,1.9,0.9,0.9,0.9],
      ["whatever345","XYZ","2021-12-07","2021-12-14",0.0,1.0,0.0,0.0,0.0],
    ], schema=["NAME","X_NAME","BEGIN","END","A","B","C","D","E"])

df2 = spark.createDataFrame(data=[
  ["2021-12-06 00:00:00",12,2021,"2021W49"],
  ["2021-12-13 00:00:00",12,2021,"2021W50"],
  ["2021-12-20 00:00:00",12,2021,"2021W51"],
  ["2021-12-27 00:00:00",12,2021,"2021W52"],
  ["2022-01-03 00:00:00",1,2022, "2022W1"],
  ["2022-01-10 00:00:00",1,2022, "2022W2"],
  ["2022-01-17 00:00:00",1,2022, "2022W3"],
  ["2022-01-24 00:00:00",1,2022, "2022W4"],
  ["2022-01-31 00:00:00",2,2022, "2022W5"],
  ["2022-02-07 00:00:00",2,2022, "2022W6"],
  ["2020-11-16 00:00:00",11,2020,"2020W47"],
  ["2020-11-23 00:00:00",11,2020,"2020W48"],
  ["2020-11-30 00:00:00",12,2020,"2020W49"],
  ["2020-12-07 00:00:00",12,2020,"2020W50"],
  ["2020-12-14 00:00:00",12,2020,"2020W51"],
  ["2020-12-21 00:00:00",12,2020,"2020W52"],
  ["2020-12-28 00:00:00",12,2020,"2020W53"],
  ["2021-01-04 00:00:00",1,2021, "2021W1"],
  ["2021-01-11 00:00:00",1,2021, "2021W2"],
  ["2020-07-06 00:00:00",7,2020,"2020W28"],
  ["2020-07-13 00:00:00",7,2020,"2020W29"],
  ["2020-07-20 00:00:00",7,2020,"2020W30"],
  ["2020-07-27 00:00:00",7,2020,"2020W31"],
  ["2020-08-03 00:00:00",8,2020,"2020W32"],
  ["2020-08-10 00:00:00",8,2020,"2020W33"],
  ["2020-08-17 00:00:00",8,2020,"2020W34"],
  ["2020-08-24 00:00:00",8,2020,"2020W35"],
  ["2020-08-31 00:00:00",9,2020,"2020W36"],
  ["2020-09-07 00:00:00",9,2020,"2020W37"],
  ["2021-03-22 00:00:00",3,2021,"2021W12"],
  ["2021-03-29 00:00:00",4,2021,"2021W13"],
  ["2021-04-05 00:00:00",4,2021,"2021W14"],
  ["2021-04-12 00:00:00",4,2021,"2021W15"],
  ["2021-04-19 00:00:00",4,2021,"2021W16"],
  ["2021-04-26 00:00:00",4,2021,"2021W17"],
  ["2021-05-03 00:00:00",5,2021,"2021W18"],
  ["2021-05-10 00:00:00",5,2021,"2021W19"],
  ["2021-05-17 00:00:00",5,2021,"2021W20"],
  ["2021-05-24 00:00:00",5,2021,"2021W21"],
  ["2022-08-22 00:00:00",8,2022,"2022W34"],
  ["2022-08-29 00:00:00",9,2022,"2022W35"],
  ["2022-09-05 00:00:00",9,2022,"2022W36"],
  ["2022-09-12 00:00:00",9,2022,"2022W37"],
  ["2022-09-19 00:00:00",9,2022,"2022W38"],
  ["2022-09-26 00:00:00",9,2022,"2022W39"],
  ["2022-10-03 00:00:00",10,2022,"2022W40"],
  ["2022-10-10 00:00:00",10,2022,"2022W41"],
  ["2022-10-17 00:00:00",10,2022,"2022W42"],
  ["2022-10-24 00:00:00",10,2022,"2022W43"],
  ["2020-09-14 00:00:00",9,2020,"2020W38"],
  ["2020-09-21 00:00:00",9,2020,"2020W39"],
  ["2020-09-28 00:00:00",10,2020,"2020W40"],
  ["2020-10-05 00:00:00",10,2020,"2020W41"],
  ["2020-10-12 00:00:00",10,2020,"2020W42"],
  ["2020-10-19 00:00:00",10,2020,"2020W43"],
  ["2020-10-26 00:00:00",10,2020,"2020W44"],
  ["2020-11-02 00:00:00",11,2020,"2020W45"],
  ["2020-11-09 00:00:00",11,2020,"2020W46"],
  ["2020-05-04 00:00:00",5,2020,"2020W19"],
  ["2020-05-11 00:00:00",5,2020,"2020W20"],
  ["2020-05-18 00:00:00",5,2020,"2020W21"],
  ["2020-05-25 00:00:00",5,2020,"2020W22"],
  ["2020-06-01 00:00:00",6,2020,"2020W23"],
  ["2020-06-08 00:00:00",6,2020,"2020W24"],
  ["2020-06-15 00:00:00",6,2020,"2020W25"],
  ["2020-06-22 00:00:00",6,2020,"2020W26"],
  ["2020-06-29 00:00:00",7,2020,"2020W27"],
  ["2021-10-04 00:00:00",10,2021,"2021W40"],
  ["2021-10-11 00:00:00",10,2021,"2021W41"],
  ["2021-10-18 00:00:00",10,2021,"2021W42"],
  ["2021-10-25 00:00:00",10,2021,"2021W43"],
  ["2021-11-01 00:00:00",11,2021,"2021W44"],
  ["2021-11-08 00:00:00",11,2021,"2021W45"],
  ["2021-11-15 00:00:00",11,2021,"2021W46"],
  ["2021-11-22 00:00:00",11,2021,"2021W47"],
  ["2021-11-29 00:00:00",12,2021,"2021W48"],
  ["2022-02-14 00:00:00",2,2022, "2022W7"],
  ["2022-02-21 00:00:00",2,2022, "2022W8"],
  ["2022-02-28 00:00:00",3,2022, "2022W9"],
  ["2022-03-07 00:00:00",3,2022,"2022W10"],
  ["2022-03-14 00:00:00",3,2022,"2022W11"],
  ["2022-03-21 00:00:00",3,2022,"2022W12"],
  ["2022-03-28 00:00:00",3,2022,"2022W13"],
  ["2022-04-04 00:00:00",4,2022,"2022W14"],
  ["2022-04-11 00:00:00",4,2022,"2022W15"],
  ["2022-04-18 00:00:00",4,2022,"2022W16"],
  ["2022-04-25 00:00:00",4,2022,"2022W17"],
  ["2022-05-02 00:00:00",5,2022,"2022W18"],
  ["2022-05-09 00:00:00",5,2022,"2022W19"],
  ["2022-05-16 00:00:00",5,2022,"2022W20"],
  ["2022-05-23 00:00:00",5,2022,"2022W21"],
  ["2022-05-30 00:00:00",6,2022,"2022W22"],
  ["2022-06-06 00:00:00",6,2022,"2022W23"],
  ["2022-06-13 00:00:00",6,2022,"2022W24"],
  ["2022-06-20 00:00:00",6,2022,"2022W25"],
  ["2022-06-27 00:00:00",6,2022,"2022W26"],
  ["2022-07-04 00:00:00",7,2022,"2022W27"],
  ["2022-07-11 00:00:00",7,2022,"2022W28"],
  ["2022-07-18 00:00:00",7,2022,"2022W29"],
  ["2022-07-25 00:00:00",7,2022,"2022W30"],
  ["2022-08-01 00:00:00",8,2022,"2022W31"],
  ["2022-08-08 00:00:00",8,2022,"2022W32"],
  ["2022-08-15 00:00:00",8,2022,"2022W33"],
  ["2021-01-18 00:00:00",1,2021, "2021W3"],
  ["2021-01-25 00:00:00",1,2021, "2021W4"],
  ["2021-02-01 00:00:00",2,2021, "2021W5"],
  ["2021-02-08 00:00:00",2,2021, "2021W6"],
  ["2021-02-15 00:00:00",2,2021, "2021W7"],
  ["2021-02-22 00:00:00",2,2021, "2021W8"],
  ["2021-03-01 00:00:00",3,2021, "2021W9"],
  ["2021-03-08 00:00:00",3,2021,"2021W10"],
  ["2021-03-15 00:00:00",3,2021,"2021W11"],
  ["2020-03-02 00:00:00",3,2020,"2020W10"],
  ["2020-03-09 00:00:00",3,2020,"2020W11"],
  ["2020-03-16 00:00:00",3,2020,"2020W12"],
  ["2020-03-23 00:00:00",3,2020,"2020W13"],
  ["2020-03-30 00:00:00",4,2020,"2020W14"],
  ["2020-04-06 00:00:00",4,2020,"2020W15"],
  ["2020-04-13 00:00:00",4,2020,"2020W16"],
  ["2020-04-20 00:00:00",4,2020,"2020W17"],
  ["2020-04-27 00:00:00",4,2020,"2020W18"],
  ["2021-05-31 00:00:00",6,2021,"2021W22"],
  ["2021-06-07 00:00:00",6,2021,"2021W23"],
  ["2021-06-14 00:00:00",6,2021,"2021W24"],
  ["2021-06-21 00:00:00",6,2021,"2021W25"],
  ["2021-06-28 00:00:00",7,2021,"2021W26"],
  ["2021-07-05 00:00:00",7,2021,"2021W27"],
  ["2021-07-12 00:00:00",7,2021,"2021W28"],
  ["2021-07-19 00:00:00",7,2021,"2021W29"],
  ["2021-07-26 00:00:00",7,2021,"2021W30"],
  ["2021-08-02 00:00:00",8,2021,"2021W31"],
  ["2021-08-09 00:00:00",8,2021,"2021W32"],
  ["2021-08-16 00:00:00",8,2021,"2021W33"],
  ["2021-08-23 00:00:00",8,2021,"2021W34"],
  ["2021-08-30 00:00:00",9,2021,"2021W35"],
  ["2021-09-06 00:00:00",9,2021,"2021W36"],
  ["2021-09-13 00:00:00",9,2021,"2021W37"],
  ["2021-09-20 00:00:00",9,2021,"2021W38"],
  ["2021-09-27 00:00:00",9,2021,"2021W39"],
  ["2019-12-30 00:00:00",1,2020, "2020W1"],
  ["2020-01-06 00:00:00",1,2020, "2020W2"],
  ["2020-01-13 00:00:00",1,2020, "2020W3"],
  ["2020-01-20 00:00:00",1,2020, "2020W4"],
  ["2020-01-27 00:00:00",1,2020, "2020W5"],
  ["2020-02-03 00:00:00",2,2020, "2020W6"],
  ["2020-02-10 00:00:00",2,2020, "2020W7"],
  ["2020-02-17 00:00:00",2,2020, "2020W8"],
  ["2020-02-24 00:00:00",2,2020, "2020W9"],
], schema=["start_of_week","month","year","week_no"])
2guxujil

2guxujil2#

可以使用datetime模块中的字符串格式time来获取所需格式的周数。

from datetime import date

#this will provide the format you want
date.strftime("%YW%W")
0lvr5msh

0lvr5msh3#

使用日期时间函数查找年份和周数,并创建一个系列以填充“开始”和“结束”范围内的周数和年份。然后展开此系列:
完整示例:

df = spark.createDataFrame(data=[
      ["whatever1","XYZ","2021-09-27","2021-10-03",0.0,1.0,0.0,0.0,0.0],
      ["whatever2","XYZ","2021-09-27","2021-10-03",0.0,1.0,0.0,0.0,0.0],
      ["whatever3","XYZ","2021-10-04","2021-10-10",0.0,1.0,0.0,0.0,0.0],
      ["whatever4","XYZ","2021-10-04","2021-10-10",0.0,1.0,0.0,0.0,0.0],
      ["whatever6","XYZ","2021-10-18","2021-10-24",0.0,0.0,1.0,0.0,0.0],
      ["whatever9","XYZ","2021-10-25","2021-10-31",0.0,1.0,0.0,0.0,0.0],
      ["whatever100","XYZ","2021-12-20","2022-01-10",0.9,1.9,0.9,0.9,0.9],
    ], schema=["NAME","X_NAME","BEGIN","END","A","B","C","D","E"])

@F.udf(returnType=ArrayType(StringType()))
def week_range(begin, end):
  from datetime import datetime
  begin_dt = datetime.strptime(begin, "%Y-%m-%d").date()
  end_dt = datetime.strptime(end, "%Y-%m-%d").date()
  if begin_dt.year == end_dt.year:
    return [f"{begin_dt.year}W{x}" for x in range(begin_dt.isocalendar()[1], end_dt.isocalendar()[1] + 1)]
  elif begin_dt.year < end_dt.year:
    return [f"{begin_dt.year}W{x}" for x in range(begin_dt.isocalendar()[1], datetime.strptime(f"{begin_dt.year}-12-31", "%Y-%m-%d").date().isocalendar()[1] + 1)] \
    + [f"{end_dt.year}W{x}" for x in range(1, end_dt.isocalendar()[1] + 1)]

df = df.withColumn("week_no", week_range("BEGIN", "END"))
df = df.withColumn("week_no", F.explode("week_no"))
df = df.drop("BEGIN", "END")

输出量:

+-----------+------+---+---+---+---+---+-------+
|NAME       |X_NAME|A  |B  |C  |D  |E  |week_no|
+-----------+------+---+---+---+---+---+-------+
|whatever1  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W39|
|whatever2  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W39|
|whatever3  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W40|
|whatever4  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W40|
|whatever6  |XYZ   |0.0|0.0|1.0|0.0|0.0|2021W42|
|whatever9  |XYZ   |0.0|1.0|0.0|0.0|0.0|2021W43|
|whatever100|XYZ   |0.9|1.9|0.9|0.9|0.9|2021W51|
|whatever100|XYZ   |0.9|1.9|0.9|0.9|0.9|2021W52|
|whatever100|XYZ   |0.9|1.9|0.9|0.9|0.9|2022W1 |
|whatever100|XYZ   |0.9|1.9|0.9|0.9|0.9|2022W2 |
+-----------+------+---+---+---+---+---+-------+

PS -在最后一张记录中,我已经测试了接近年底和新年的一周的角落案例。

相关问题