sparksql中的相关子查询

6qqygrtg 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(794)

我有以下两个表，我必须使用相关子查询检查它们之间是否存在值。
要求是-对于 orders 检查相应的 custid 存在于 customer 表，然后输出一个字段（名为 FLAG )有价值的 Y 如果 custid 存在，否则 N 如果没有。
订单：

orderid | custid
12345   | XYZ
34566   | XYZ
68790   | MNP
59876   | QRS
15620   | UVW

客户：

id | custid
1  | XYZ
2  | UVW

预期产量：

orderid | custid  | FLAG
12345   | XYZ     | Y
34566   | XYZ     | Y 
68790   | MNP     | N
59876   | QRS     | N
15620   | UVW     | Y

我试过下面的方法，但没能成功-

select 
o.orderid,
o.custid,
case when o.custid EXISTS (select 1 from customer c on c.custid = o.custid)
     then 'Y'
     else 'N'
end as flag
from orders o

这可以用相关标量子查询来解决吗？如果不是，那么实现这一要求的最佳方法是什么？
请告知。
注意：使用spark sql query v2.4.0
谢谢。

apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/63079534/correlated-subquery-in-spark-sql

1条答案

按热度按时间

t1rydlwq1#

in/exists predicate 子查询只能在spark中的筛选器中使用。
以下操作在本地重新创建的数据副本中起作用：

select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
          from (select o.orderid, o.custid, c.custid existing_customer
                from orders o
                left join customer c
                 on c.custid = o.custid)

以下是它如何处理重新创建的数据：

def textToView(csv: String, viewName: String) = {
   spark.read
  .option("ignoreLeadingWhiteSpace", "true")
  .option("ignoreTrailingWhiteSpace", "true")
  .option("delimiter", "|")
  .option("header", "true")
  .csv(spark.sparkContext.parallelize(csv.split("\n")).toDS)
  .createOrReplaceTempView(viewName)
}

textToView("""id | custid
              1  | XYZ
              2  | UVW""", "customer")

textToView("""orderid | custid
              12345   | XYZ
              34566   | XYZ
              68790   | MNP
              59876   | QRS
              15620   | UVW""", "orders")

spark.sql("""
          select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
          from (select o.orderid, o.custid, c.custid existing_customer
                from orders o
                left join customer c
                 on c.custid = o.custid)""").show

+-------+------+-----------------+
|orderid|custid|existing_customer|
+-------+------+-----------------+
|  59876|   QRS|                N|
|  12345|   XYZ|                Y|
|  34566|   XYZ|                Y|
|  68790|   MNP|                N|
|  15620|   UVW|                Y|
+-------+------+-----------------+

赞(0）回复(0）举报 2021-05-27

我来回答

sparksql中的相关子查询

1条答案

相关问题

热门标签

最新问答