scala：从Dataframe中选择列时使用sparksql函数

wj8zmpe1 于 2021-07-14 发布在 Java

关注(0)|答案(1)|浏览(324)

我有两个表/数据框： A 以及 B a有以下列： cust_id, purch_date b有一列： cust_id, col1 （不需要col1）
以下示例显示了每个表的内容：

Table A
cust_id  purch_date
  34564  2017-08-21
  34564  2017-08-02
  34564  2017-07-21
  23847  2017-09-13
  23423  2017-06-19

Table B
cust_id  col1
  23442     x
  12452     x
  12464     x  
  23847     x
  24354     x

我要选择 cust_id 一月的第一天 purch_date 选择的位置 cust_id 不在里面吗 B .
这可以通过以下命令在sql中实现：

select a.cust_id, trunc(purch_date, 'MM') as mon
from a
left join b
on a.cust_id = b.cust_id
where b.cust_id is null
group by cust_id, mon;

输出如下：

Table A
cust_id  purch_date
  34564  2017-08-01
  34564  2017-07-01
  23423  2017-06-01

为了在scala中实现同样的功能，我尝试了以下方法：

import org.apache.spark.sql.functions._

a = spark.sql("select * from db.a")
b = spark.sql("select * from db.b")

var out = a.join(b, Seq("cust_id"), "left")
           .filter("col1 is null")
           .select("cust_id", trunc("purch_date", "month"))
           .distinct()

但我得到了不同的错误，比如：

error: type mismatch; found: StringContext required: ?{def $: ?}

我被困在这里，在网上找不到足够的文档/答案。

scala DataFrame apache-spark sql-function

来源：https://stackoverflow.com/questions/54838024/scala-using-a-spark-sql-function-when-selecting-column-from-a-dataframe

1条答案

按热度按时间

4dc9hkyq1#

Select 应包含 Columns 而不是 Strings :
输入：

df1:
+-------+----------+
|cust_id|purch_date|
+-------+----------+
|  34564|2017-08-21|
|  34564|2017-08-02|
|  34564|2017-07-21|
|  23847|2017-09-13|
|  23423|2017-06-19|
+-------+----------+    

df2:
+-------+----+
|cust_id|col1|
+-------+----+
|  23442|   X|
|  12452|   X|
|  12464|   X|
|  23847|   X|
|  24354|   X|
+-------+----+

按以下方式更改查询：

df1.join(df2, Seq("cust_id"), "left").filter("col1 is null")
.select($"cust_id", trunc($"purch_date", "MM"))
.distinct()
.show()

输出：

+-------+---------------------+
|cust_id|trunc(purch_date, MM)|
+-------+---------------------+
|  23423|           2017-06-01|
|  34564|           2017-07-01|
|  34564|           2017-08-01|
+-------+---------------------+

赞(0）回复(0）举报 2021-07-14

我来回答

scala：从Dataframe中选择列时使用sparksql函数

1条答案

相关问题

热门标签

最新问答