pyspark 如何在spark scala中从输入文件中读取不同类型的段?

bvjveswy  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(104)

样本数据到达输入文件:样本数据包括三个部分- BS -借款人,RS -关系,CR -信贷(贸易线)。如何使用spark读取具有单独布局和解析的数据。
段的布局定义

BS - Borrower: Indicator, First name, Last, name, Company name, joining date
RS - Relationship: Indicator, Name, Role, Company Name
CR - Credity facility: Indicator, Type of Loan, Account Number, Sanctioned date, Amount

样本数据:

BS,Rohan,Mundle,Infy,20230101
RS,Sohan Mundle,Director,Croma
CR,Home Loan, 10023045, 20200101, 10000.00
BS,Priyatee,Sinha,L&T,20220101
RS,Mohan Mehta,Owner, ABC Tech
CR,Home Loan, 20023045, 20200301, 50000.00

如何在spark scala中使用类型安全方法读取上述数据。

e4yzc0pl

e4yzc0pl1#

通过阅读Dataset[String]中的数据并按指示符过滤数据,并分配正确的模式,可以从一个文本文件中提取三个不同的模式:

val mixedData: Dataset[String] = spark.read.textFile("sampleData.csv")

def readWithSchema(indicator: String, schema: StructType): DataFrame = {
  val segmentData = mixedData.filter(_.startsWith(indicator))
  spark.read.schema(schema).csv(segmentData)
}

val borrowerSchema = StructType(
  Seq(
    StructField(name = "Indicator", dataType = StringType),
    StructField(name = "First name", dataType = StringType),
    StructField(name = "Last name", dataType = StringType),
    StructField(name = " Company name", dataType = StringType),
    StructField(name = " joining date", dataType = StringType)
  )
)
val borrowers = readWithSchema("BS", borrowerSchema)
// Declare schema for `Relationship` and `Credity facility`, and read with `readWithSchema`

输出量:

+---------+----------+---------+-------------+-------------+
|Indicator|First name|Last name| Company name| joining date|
+---------+----------+---------+-------------+-------------+
|BS       |Rohan     |Mundle   |Infy         |20230101     |
|BS       |Priyatee  |Sinha    |L&T          |20220101     |
+---------+----------+---------+-------------+-------------+

相关问题