sparksql连接来自同一个表的两个结果

e5njpo68  于 2021-08-09  发布在  Java
关注(0)|答案(1)|浏览(508)

我有一张名为“已售物品”的表格,如下所示。我想使用sparksql来获取每个参与者的净销售额。

Item     Buyer   Seller   Qty
----------------------------------
A        JD      Lidl     100
B        SD      JD       500
A        Coop    JD       125
C        JD      SD       300

中间工作台

Item     Participant      Buy         Sell
--------------------------------------------
A        JD                100        125 
B        JD                0          500
C        JD                300          0
A        Coop              125          0
A        Lidl                0        100
B        SD                500          0    
C        SD                  0        300

最终结果如下所示。

Item     Participant      Net Sell
----------------------------------
A        JD                 25
B        JD                500
C        JD               -300
A        Coop             -125
A        Lidl              100
B        SD               -500  
C        SD                300

我有以下两个关于第一个表的买卖方面的问题。
购买:

SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer

卖:

SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller

我试图得到中间表,这样我就可以使用该表得到最终结果。但我似乎不能把这两个问题结合起来。如果您对如何结合以上两个查询来获得中间表有任何建议,我们将不胜感激。

gjmwrych

gjmwrych1#

拆开并重新聚集。这是最简单的 union all :

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty
       from sold_items
       group by buyer
      ) union all
      (select seller as user, 0, sum(qty)
       from sold_items
       group by seller
      )
     ) bs
group by user;

请注意,实际上并不需要子查询中的聚合,因此这样也可以:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, qty as buy_qty, 0 as sell_qty
       from sold_items
      ) union all
      (select seller as user, 0, qty
       from sold_items
      )
     ) bs
group by user;

我希望多聚合版本在大型数据集上有更好的性能——尽管改进可能没有那么大。

相关问题