pyspark if-else和循环编码问题

hm2xizp9  于 2021-07-13  发布在  Spark
关注(0)|答案(0)|浏览(372)

我从来没有写循环或任何循环之前,这是我第一次尝试这样做,我现在卡住了。
这是我原来的table:
df1型

  1. ID supply_date supply_days_cnt food
  2. 1 2020/01/01 28 cheese
  3. 1 2020/02/01 28 cheese
  4. 1 2020/03/01 28 meat
  5. 1 2020/03/04 28 cheese
  6. 1 2020/04/01 28 meat
  7. 1 2020/09/01 28 cheese

我设法做到了如下:
df2型

  1. supply_end_Date is supply_date + supply_days_cnt
  2. next_supply_date is the supply_date from next row for the same food category
  3. gap_until_next_supply is next_supply_date - supply_end_date, it can be negative
  4. ID supply_date supply_end_date food next_supply_date gap_until_next_supply
  5. 1 2020/01/01 2020/01/29 cheese 2020/02/01 2
  6. 1 2020/02/01 2020/02/29 cheese 2020/03/04 3
  7. 1 2020/03/01 2020/3/29 meat 2020/04/01 1
  8. 1 2020/03/04 2020/04/01 cheese 2020/09/01 150
  9. 1 2020/04/01 2020/04/29 meat null null
  10. 1 2020/09/01 2020/09/29 cheese null null

以下是word的逻辑:

  1. 1. If interval between two food shippings is great than or equal to 90 days,
  2. these two food shipping belong to two different lines of shipment
  3. 2. For intervals < 90 days between current shipping and previous shipping:
  4. a. If the food shipping are same, they belong to the same line
  5. b. If the food shipping are totally different, look back 90 days:
  6. If the same current food shipping found within past 90 days, then same line
  7. Else look forward 90 days:
  8. If the same current food shipping found within next 90 days, then same line
  9. Else different lines

因此,这方面的期望输出应该是:

  1. ID supply_start_date supply_end_date food
  2. 1 2020/01/01 2020/04/29 cheese,meat
  3. 1 2020/09/01 2020/09/29 cheese

这是我的代码,还没有完成,但我已经得到错误。

  1. if food1 = food2
  2. then return agg(array_sort_udf(collect_list('food'))
  3. else if food1 != food2,
  4. if food1.min(supply_date) between food2.min(supply_date) and food2.max(supply_date)
  5. then return agg(array_sort_udf(collect_list('food'))
  6. else if food1.max(supply_date) between food2.min(supply_date) and food2.max(supply_date)
  7. then return agg(array_sort_udf(collect_list('food'))
  8. if food2.min(supply_date) between food1.min(supply_date) and food1.max(supply_date)
  9. then return agg(array_sort_udf(collect_list('food'))
  10. else if food2.max(supply_date) between food1.min(supply_date) and food1.max(supply_date)
  11. then return agg(array_sort_udf(collect_list('food'))

所以如果有6组食物分类,上面的代码需要重复6次,效率不是很高。
df1和df2都是dataframe(pyspark dataframe),我应该把它们改成dictionary或者其他什么来用于循环吗?
有没有一种方法可以直接使用pyspark代码来获得所需的输出?不使用循环?
第一次写循环,任何帮助都会非常感激。我真的很想学。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题