区分大小写在Dataframe中创建副本,而不是spark scala中的distinct

ht4b089n  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(532)

我在scala 2.4中使用spark。

  1. spark.sqlContext.sql("set spark.sql.caseSensitive=false")
  2. spark.sql("select Distinct p.Area,c.Remarks from mytable c join areatable p on c.id=p.id where c.remarks = 'Sufficient Amounts'")

我使用了distinct,甚至比我得到的每个记录3记录。

  1. DISTRICT_1| Sufficient Amounts
  2. District_1| Sufficient Amounts
  3. district_1| Sufficient Amounts
  4. DISTRICT_10|Sufficient Amounts
  5. District_10|Sufficient Amounts
  6. district_10|Sufficient Amounts

就连我也明确地说 spark.sqlContext.sql("set spark.sql.caseSensitive=false"). 预期产量:

  1. DISTRICT_1 |Sufficient Amounts
  2. DISTRICT_10 |Sufficient Amounts

我需要做点什么吗。请分享你的想法。

r3i60tvu

r3i60tvu1#

spark.sql.caseSensitive 用于表示不区分大小写的列名(而不是转换列值)
使用 window row_number ()此案例的函数。 Example: ```
df.show()

//+-----------+------------------+
//| Area| Remarks|
//+-----------+------------------+
//| DISTRICT_1|Sufficient Amounts|
//| District_1|Sufficient Amounts|
//| district_1|Sufficient Amounts|
//|DISTRICT_10|Sufficient Amounts|
//|District_10|Sufficient Amounts|
//|district_10|Sufficient Amounts|
//+-----------+------------------+

df.createOrReplaceTempView("mytable")

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")

sql("select (rn)?+.+ from (select *, row_number() over(partition by lower(Area) order by 1) as rn from mytable)q where q.rn =1").show()

//+-----------+------------------+
//| Area| Remarks|
//+-----------+------------------+
//| DISTRICT_1|Sufficient Amounts|
//|DISTRICT_10|Sufficient Amounts|
//+-----------+------------------+

展开查看全部

相关问题