如何基于scala/spark中的case类更改dataframe中列的数据类型

hivapdat  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(308)

我正在尝试基于case类转换某些列的数据类型。

val simpleDf = Seq(("James",34,"2006-01-01","true","M",3000.60),
                     ("Michael",33,"1980-01-10","true","F",3300.80),
                     ("Robert",37,"1995-01-05","false","M",5000.50)
                 ).toDF("firstName","age","jobStartDate","isGraduated","gender","salary")

// Output
simpleDf.printSchema()
root
|-- firstName: string (nullable = true)
|-- age: integer (nullable = false)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = false)

这里我想更改 jobStartDate 时间戳和 isGraduated 到布尔值。我想知道是否可以使用case类进行转换?我知道这可以通过强制转换每一列来实现,但是在我的例子中,我需要根据定义的case类Map传入的df。

case class empModel(firstName:String, 
                       age:Integer, 
                       jobStartDate:java.sql.Timestamp, 
                       isGraduated:Boolean, 
                       gender:String,
                       salary:Double
                      )

val newDf = simpleData.as[empModel].toDF
newDf.show(false)

我得到错误,因为字符串的时间戳对话。有解决办法吗?

eivgtgni

eivgtgni1#

可以使用从case类生成模式 ScalaReflection :

import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.ScalaReflection

val schema = ScalaReflection.schemaFor[empModel].dataType.asInstanceOf[StructType]

现在,您可以在将文件加载到dataframe时传递这个模式。
或者,如果希望在读取Dataframe后强制转换某些或所有列,则可以迭代模式字段并强制转换为相应的数据类型。通过使用 foldLeft 例如:

val df = schema.fields.foldLeft(simpleDf){ 
  (df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))     
}

df.printSchema

//root
// |-- firstName: string (nullable = true)
// |-- age: integer (nullable = true)
// |-- jobStartDate: timestamp (nullable = true)
// |-- isGraduated: boolean (nullable = false)
// |-- gender: string (nullable = true)
// |-- salary: double (nullable = false)

df.show
//+---------+---+-------------------+-----------+------+------+
//|firstName|age|       jobStartDate|isGraduated|gender|salary|
//+---------+---+-------------------+-----------+------+------+
//|    James| 34|2006-01-01 00:00:00|       true|     M|3000.6|
//|  Michael| 33|1980-01-10 00:00:00|       true|     F|3300.8|
//|   Robert| 37|1995-01-05 00:00:00|      false|     M|5000.5|
//+---------+---+-------------------+-----------+------+------+

相关问题