区分大小写在Dataframe中创建副本,而不是spark scala中的distinct

ht4b089n  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(483)

我在scala 2.4中使用spark。

spark.sqlContext.sql("set spark.sql.caseSensitive=false")
    spark.sql("select Distinct p.Area,c.Remarks from mytable c join areatable p on c.id=p.id where c.remarks = 'Sufficient Amounts'")

我使用了distinct,甚至比我得到的每个记录3记录。

DISTRICT_1| Sufficient Amounts
District_1| Sufficient Amounts
district_1| Sufficient Amounts
DISTRICT_10|Sufficient Amounts
District_10|Sufficient Amounts
district_10|Sufficient Amounts

就连我也明确地说 spark.sqlContext.sql("set spark.sql.caseSensitive=false"). 预期产量:

DISTRICT_1  |Sufficient Amounts
DISTRICT_10 |Sufficient Amounts

我需要做点什么吗。请分享你的想法。

r3i60tvu

r3i60tvu1#

spark.sql.caseSensitive 用于表示不区分大小写的列名(而不是转换列值)
使用 window row_number ()此案例的函数。 Example: ```
df.show()

//+-----------+------------------+
//| Area| Remarks|
//+-----------+------------------+
//| DISTRICT_1|Sufficient Amounts|
//| District_1|Sufficient Amounts|
//| district_1|Sufficient Amounts|
//|DISTRICT_10|Sufficient Amounts|
//|District_10|Sufficient Amounts|
//|district_10|Sufficient Amounts|
//+-----------+------------------+

df.createOrReplaceTempView("mytable")

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")

sql("select (rn)?+.+ from (select *, row_number() over(partition by lower(Area) order by 1) as rn from mytable)q where q.rn =1").show()

//+-----------+------------------+
//| Area| Remarks|
//+-----------+------------------+
//| DISTRICT_1|Sufficient Amounts|
//|DISTRICT_10|Sufficient Amounts|
//+-----------+------------------+

相关问题