我从来没有写循环或任何循环之前,这是我第一次尝试这样做,我现在卡住了。
这是我原来的table:
df1型
ID supply_date supply_days_cnt food
1 2020/01/01 28 cheese
1 2020/02/01 28 cheese
1 2020/03/01 28 meat
1 2020/03/04 28 cheese
1 2020/04/01 28 meat
1 2020/09/01 28 cheese
我设法做到了如下:
df2型
supply_end_Date is supply_date + supply_days_cnt
next_supply_date is the supply_date from next row for the same food category
gap_until_next_supply is next_supply_date - supply_end_date, it can be negative
ID supply_date supply_end_date food next_supply_date gap_until_next_supply
1 2020/01/01 2020/01/29 cheese 2020/02/01 2
1 2020/02/01 2020/02/29 cheese 2020/03/04 3
1 2020/03/01 2020/3/29 meat 2020/04/01 1
1 2020/03/04 2020/04/01 cheese 2020/09/01 150
1 2020/04/01 2020/04/29 meat null null
1 2020/09/01 2020/09/29 cheese null null
以下是word的逻辑:
1. If interval between two food shippings is great than or equal to 90 days,
these two food shipping belong to two different lines of shipment
2. For intervals < 90 days between current shipping and previous shipping:
a. If the food shipping are same, they belong to the same line
b. If the food shipping are totally different, look back 90 days:
If the same current food shipping found within past 90 days, then same line
Else look forward 90 days:
If the same current food shipping found within next 90 days, then same line
Else different lines
因此,这方面的期望输出应该是:
ID supply_start_date supply_end_date food
1 2020/01/01 2020/04/29 cheese,meat
1 2020/09/01 2020/09/29 cheese
这是我的代码,还没有完成,但我已经得到错误。
if food1 = food2
then return agg(array_sort_udf(collect_list('food'))
else if food1 != food2,
if food1.min(supply_date) between food2.min(supply_date) and food2.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
else if food1.max(supply_date) between food2.min(supply_date) and food2.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
if food2.min(supply_date) between food1.min(supply_date) and food1.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
else if food2.max(supply_date) between food1.min(supply_date) and food1.max(supply_date)
then return agg(array_sort_udf(collect_list('food'))
所以如果有6组食物分类,上面的代码需要重复6次,效率不是很高。
df1和df2都是dataframe(pyspark dataframe),我应该把它们改成dictionary或者其他什么来用于循环吗?
有没有一种方法可以直接使用pyspark代码来获得所需的输出?不使用循环?
第一次写循环,任何帮助都会非常感激。我真的很想学。
暂无答案!
目前还没有任何答案,快来回答吧!