检查dataframe中列的计数,然后添加列并计数为map

kzipqqlq  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(422)

我是scala初学者。我尝试在表的列中查找空值的计数,并在map中添加列名和count作为键值对。下面的代码无法按预期工作。请指导我如何修改此代码使其工作

def nullCheck(databaseName:String,tableName:String) ={
var map = scala.collection.mutable.Map[String, Int]() 
validationColumn = Array(col1,col2)
for(i <- 0 to validationColumn.length) {
val nullVal = spark.sql(s"select count(*) from $databaseName.$tableName where validationColumn(i) is NULL")
if(nullval == 0)
map(validationColumn(i)) = nullVal
map
}

函数应该返回((col1,count),(col2,count))作为Map

oalqel3c

oalqel3c1#

这可以通过创建一个动态sql字符串,然后Map它来实现。您的方法多次读取相同的数据
这是解决办法。我使用了一个“示例”Dataframe。

scala> val inputDf = Seq((Some("Sam"),None,200),(None,Some(31),30),(Some("John"),Some(25),25),(Some("Harry"),None,100)).toDF("name","age","not_imp_column")

scala> inputDf.show(false)
+-----+----+--------------+
|name |age |not_imp_column|
+-----+----+--------------+
|Sam  |null|200           |
|null |31  |30            |
|John |25  |25            |
|Harry|null|100           |
+-----+----+--------------+

我们的validationcolumns是 name 以及 age 在我们计算空值的地方,我们把它们放在一个列表中 scala> val validationColumns = List("name","age") 我们创建一个sql字符串来驱动整个计算 scala> val sqlStr = "select " + validationColumns.map(x => "sum(" + x + "_count) AS " + x + "_sum" ).mkString(",") + " from (select " + validationColumns.map(x => "case when " + x + " = '$$' then 1 else 0 end AS " + x + "_count").mkString(",") + " from " +" (select" + validationColumns.map(x => " nvl( " + x +",'$$') as " + x).mkString(",") + " from example_table where " + validationColumns.map(x => x + " is null ").mkString("or ") + " ) layer1 ) layer2 " 解析为==> "select sum(name_count) AS name_sum,sum(age_count) AS age_sum from (select case when name = '$$' then 1 else 0 end AS name_count,case when age = '$$' then 1 else 0 end AS age_count from (select nvl( name,'$$') as name, nvl( age,'$$') as age from example_table where name is null or age is null ) layer1 ) layer2 " 现在我们创建一个Dataframe的临时视图 inputDf.createOrReplaceTempView("example_table") 唯一要做的就是执行 sql 创造一个 Map 这是由 validationColumns zip spark.sql(sqlStr).collect.map(_.toSeq).flatten.toList toMap 和结果 Map(name -> 1, age -> 2) //很明显你可以让它类型安全

相关问题