pyspark if-else和循环编码问题

hm2xizp9  于 2021-07-13  发布在  Spark
关注(0)|答案(0)|浏览(356)

我从来没有写循环或任何循环之前,这是我第一次尝试这样做,我现在卡住了。
这是我原来的table:
df1型

ID   supply_date     supply_days_cnt   food  
1      2020/01/01       28             cheese
1      2020/02/01       28             cheese
1      2020/03/01       28              meat
1      2020/03/04       28             cheese
1      2020/04/01       28              meat
1      2020/09/01       28             cheese

我设法做到了如下:
df2型

supply_end_Date is supply_date + supply_days_cnt
    next_supply_date is the supply_date from next row for the same food category 
    gap_until_next_supply is next_supply_date - supply_end_date, it can be negative 

ID       supply_date   supply_end_date    food   next_supply_date  gap_until_next_supply       
1         2020/01/01    2020/01/29       cheese   2020/02/01           2
1         2020/02/01    2020/02/29       cheese   2020/03/04           3
1         2020/03/01    2020/3/29        meat     2020/04/01           1
1         2020/03/04    2020/04/01       cheese   2020/09/01           150
1         2020/04/01    2020/04/29       meat    null                null
1         2020/09/01    2020/09/29       cheese   null                null

以下是word的逻辑:

1.  If interval between two food shippings is great than or equal to 90 days,
    these two food shipping belong to two different lines of shipment 

2.  For intervals < 90 days between current shipping and previous shipping:
a.  If the food shipping are same, they belong to the same line
b.  If the food shipping are totally different, look back 90 days: 
        If the same current food shipping found within past 90 days, then same line
             Else look forward 90 days:
                       If the same current food shipping found within next 90 days, then same line
                                   Else different lines

因此,这方面的期望输出应该是:

ID supply_start_date  supply_end_date  food
1   2020/01/01         2020/04/29      cheese,meat
1   2020/09/01         2020/09/29      cheese

这是我的代码,还没有完成,但我已经得到错误。

if food1 = food2
    then return agg(array_sort_udf(collect_list('food'))
  else if food1 != food2,
      if  food1.min(supply_date) between food2.min(supply_date) and food2.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
        else if food1.max(supply_date) between food2.min(supply_date) and food2.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
      if  food2.min(supply_date) between food1.min(supply_date) and food1.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))
        else if food2.max(supply_date) between food1.min(supply_date) and food1.max(supply_date)
          then return agg(array_sort_udf(collect_list('food'))

所以如果有6组食物分类,上面的代码需要重复6次,效率不是很高。
df1和df2都是dataframe(pyspark dataframe),我应该把它们改成dictionary或者其他什么来用于循环吗?
有没有一种方法可以直接使用pyspark代码来获得所需的输出?不使用循环?
第一次写循环,任何帮助都会非常感激。我真的很想学。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题