从列表中更新Dataframe列名称,避免使用var

ar5n3qh5  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(366)

我有一个定义列的列表:

case class ExcelColumn(colName: String, colType: String, colCode: String)

val cols = List(
  ExcelColumn("Products Selled", "text", "products_selled"),
  ExcelColumn("Total Value", "int", "total_value"),
)

和一个带有标题列的文件(csv) Products Selled , Total Value )读取为Dataframe。

val df = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(filePath)

  // csv file have header as colNames
  var finalDf = df
      .withColumn("row_id", monotonically_increasing_id)
      .select(cols
         .map(_.name.trim)
         .map(col): _*)

  // convert df col names as colCodes (for kudu table columns)
  cols.foreach(col => finalDf = finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))

在最后一行中,我将dataframe列名从 Products Selled 进入 products_selled . 因此,finaldf是一个 var .
我想知道是否有一个解决方案将finaldf声明为val,而不是var。
我试过下面的代码,但是 withColumnRenamed 返回一个新的Dataframe,但我不能在外部执行此操作 cols.foreach ```
cols.foreach(col => finalDf.withColumnRenamed(col.name.trim, col.colCode.trim))

watbbzwu

watbbzwu1#

更好的方法是使用 foldLeftwithColumnRenamed ```
case class ExcelColumn(colName: String, colType: String, colCode: String)

val cols = List(
ExcelColumn("Products Selled", "text", "products_selled"),
ExcelColumn("Total Value", "int", "total_value"),
)

val resultDF = cols.foldLeft(df){(acc, name ) =>
acc.withColumnRenamed(name.colName.trim, name.colCode.trim)
}

原始架构:

root
|-- Products Selled: integer (nullable = false)
|-- Total Value: string (nullable = true)
|-- value: integer (nullable = false)

新架构:

root
|-- products_selled: integer (nullable = false)
|-- total_value: string (nullable = true)
|-- value: integer (nullable = false)

3pmvbmvn

3pmvbmvn2#

使用 select 可以重命名列。
重命名内部列 select 比…快 foldLeft ,检查post以进行比较。
试试下面的代码。

case class ExcelColumn(colName: String, colType: String, colCode: String)

val cols = List(
  ExcelColumn("Products Selled", "string", "products_selled"),
  ExcelColumn("Total Value", "int", "total_value"),
)
val colExpr = cols.map(c => trim(col(c.colName)).as(c.colCode.trim))

如果在中存储有效的列数据类型 ExcelColumn case类,可以使用如下列数据类型。

val colExpr = cols.map(c => trim(col(c.colName).cast(c.colType)).as(c.colCode.trim))
finalDf.select(colExpr:_*)

相关问题