spark2.4中的复杂文件解析

quhf5bfb  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(397)

带scala 2.4的spark
我的源数据如下所示。

Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615
Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605
Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265
Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178

用来整平文件的代码。

val SalespersontextDF = spark.read.text("D:/prints/sales.txt")
val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")
    val processedDF = SalespersontextDF.selectExpr(s"stack(${df1.columns.length}, $stringCol) as (Salesperson, Customer)")


不幸的是,它没有在正确的字段中填充salesperson,而是将硬编码的值填充为“value”,而不是salespersonnumber。销售人员的号码也会转移到另一个领域。
非常感谢你的帮助。

jgwigjjp

jgwigjjp1#

下面的方法可能会解决您的问题,

import org.apache.spark.sql.functions._

val SalespersontextDF = spark.read.text("/home/sathya/Desktop/stackoverflo/data/sales.txt")

val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")

val processedDF = SalespersontextDF.selectExpr(s"stack(${SalespersontextDF.columns.length}, $stringCol) as (Salesperson, Customer)")

processedDF.show(false)
/*
+-----------+----------------------------------------------------------------------+
|Salesperson|Customer                                                              |
+-----------+----------------------------------------------------------------------+
|value      |Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615|
|value      |Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605 |
|value      |Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265|
|value      |Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178 |
+-----------+----------------------------------------------------------------------+

* /

processedDF.withColumn("Salesperson", split($"Customer", ":").getItem(0)).withColumn("Customer", split($"Customer", ":").getItem(1)).show(false)
/*
+--------------+-------------------------------------------------------+
|Salesperson   |Customer                                               |
+--------------+-------------------------------------------------------+
|Salesperson_21| Customer_575,Customer_2703,Customer_2682,Customer_2615|
|Salesperson_11| Customer_454,Customer_158,Customer_1859,Customer_2605 |
|Salesperson_10| Customer_1760,Customer_613,Customer_3008,Customer_1265|
|Salesperson_4 | Customer_1545,Customer_1312,Customer_861,Customer_2178|
+--------------+-------------------------------------------------------+

* /
2lpgd968

2lpgd9682#

试试这个-

spark.read
        .schema("Salesperson STRING, Customer STRING")
      .option("sep", ":")
      .csv("D:/prints/sales.txt")

相关问题