我在scala 2.4中使用spark。
spark.sqlContext.sql("set spark.sql.caseSensitive=false")
spark.sql("select Distinct p.Area,c.Remarks from mytable c join areatable p on c.id=p.id where c.remarks = 'Sufficient Amounts'")
我使用了distinct,甚至比我得到的每个记录3记录。
DISTRICT_1| Sufficient Amounts
District_1| Sufficient Amounts
district_1| Sufficient Amounts
DISTRICT_10|Sufficient Amounts
District_10|Sufficient Amounts
district_10|Sufficient Amounts
就连我也明确地说 spark.sqlContext.sql("set spark.sql.caseSensitive=false").
预期产量:
DISTRICT_1 |Sufficient Amounts
DISTRICT_10 |Sufficient Amounts
我需要做点什么吗。请分享你的想法。
1条答案
按热度按时间r3i60tvu1#
spark.sql.caseSensitive
用于表示不区分大小写的列名(而不是转换列值)使用
window row_number
()此案例的函数。Example:
```df.show()
//+-----------+------------------+
//| Area| Remarks|
//+-----------+------------------+
//| DISTRICT_1|Sufficient Amounts|
//| District_1|Sufficient Amounts|
//| district_1|Sufficient Amounts|
//|DISTRICT_10|Sufficient Amounts|
//|District_10|Sufficient Amounts|
//|district_10|Sufficient Amounts|
//+-----------+------------------+
df.createOrReplaceTempView("mytable")
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
sql("select
(rn)?+.+
from (select *, row_number() over(partition by lower(Area) order by 1) as rn from mytable)q where q.rn =1").show()//+-----------+------------------+
//| Area| Remarks|
//+-----------+------------------+
//| DISTRICT_1|Sufficient Amounts|
//|DISTRICT_10|Sufficient Amounts|
//+-----------+------------------+