pyspark if-else和循环编码问题

我从来没有写循环或任何循环之前，这是我第一次尝试这样做，我现在卡住了。
这是我原来的table：
df1型

ID   supply_date     supply_days_cnt   food  
1      2020/01/01       28             cheese
1      2020/02/01       28             cheese
1      2020/03/01       28              meat
1      2020/03/04       28             cheese
1      2020/04/01       28              meat
1      2020/09/01       28             cheese

我设法做到了如下：
df2型

supply_end_Date is supply_date + supply_days_cnt
    next_supply_date is the supply_date from next row for the same food category 
    gap_until_next_supply is next_supply_date - supply_end_date, it can be negative 
ID       supply_date   supply_end_date    food   next_supply_date  gap_until_next_supply       
1         2020/01/01    2020/01/29       cheese   2020/02/01           2
1         2020/02/01    2020/02/29       cheese   2020/03/04           3
1         2020/03/01    2020/3/29        meat     2020/04/01           1
1         2020/03/04    2020/04/01       cheese   2020/09/01           150
1         2020/04/01    2020/04/29       meat    null                null
1         2020/09/01    2020/09/29       cheese   null                null

以下是word的逻辑：

1.  If interval between two food shippings is great than or equal to 90 days,
    these two food shipping belong to two different lines of shipment 
2.  For intervals < 90 days between current shipping and previous shipping:
a.  If the food shipping are same, they belong to the same line
b.  If the food shipping are totally different, look back 90 days: 
        If the same current food shipping found within past 90 days, then same line
             Else look forward 90 days:
                       If the same current food shipping found within next 90 days, then same line
                                   Else different lines

因此，这方面的期望输出应该是：

ID supply_start_date  supply_end_date  food
1   2020/01/01         2020/04/29      cheese,meat
1   2020/09/01         2020/09/29      cheese

这是我的代码，还没有完成，但我已经得到错误。

if food1 = food2
    then return agg(array_sort_udf(collect_list('food'))
  else if food1 != food2,
      if  food1.min(supply_date) between food2.min(supply_date) and food2.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
        else if food1.max(supply_date) between food2.min(supply_date) and food2.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
      if  food2.min(supply_date) between food1.min(supply_date) and food1.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
        else if food2.max(supply_date) between food1.min(supply_date) and food1.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))

所以如果有6组食物分类，上面的代码需要重复6次，效率不是很高。
df1和df2都是dataframe（pyspark dataframe），我应该把它们改成dictionary或者其他什么来用于循环吗？
有没有一种方法可以直接使用pyspark代码来获得所需的输出？不使用循环？
第一次写循环，任何帮助都会非常感激。我真的很想学。

pyspark if-else和循环编码问题

暂无答案！

相关问题

热门标签

最新问答