区分大小写在Dataframe中创建副本，而不是spark scala中的distinct

ht4b089n 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(532)

我在scala 2.4中使用spark。

spark.sqlContext.sql("set spark.sql.caseSensitive=false")
    spark.sql("select Distinct p.Area,c.Remarks from mytable c join areatable p on c.id=p.id where c.remarks = 'Sufficient Amounts'")

我使用了distinct，甚至比我得到的每个记录3记录。

DISTRICT_1| Sufficient Amounts
District_1| Sufficient Amounts
district_1| Sufficient Amounts
DISTRICT_10|Sufficient Amounts
District_10|Sufficient Amounts
district_10|Sufficient Amounts

就连我也明确地说 spark.sqlContext.sql("set spark.sql.caseSensitive=false"). 预期产量：

DISTRICT_1  |Sufficient Amounts
DISTRICT_10 |Sufficient Amounts

我需要做点什么吗。请分享你的想法。

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/63207957/case-sensitive-create-duplicate-in-dataframe-inspite-of-distinct-in-spark-scala

1条答案

按热度按时间

r3i60tvu1#

spark.sql.caseSensitive 用于表示不区分大小写的列名（而不是转换列值）
使用 window row_number （）此案例的函数。 Example: ```
df.show()

df.createOrReplaceTempView("mytable")

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")

sql("select (rn)?+.+ from (select *, row_number() over(partition by lower(Area) order by 1) as rn from mytable)q where q.rn =1").show()

展开查看全部

赞(0）回复(0）举报 2021-05-27

我来回答

区分大小写在Dataframe中创建副本，而不是spark scala中的distinct

1条答案

相关问题

热门标签

最新问答