在pyspark list dataframe中的每个元素后面追加一个值

ou6hu8tu  于 2021-05-29  发布在  Spark
关注(0)|答案(3)|浏览(543)

我有一个这样的Dataframe

Data           ID   

[1,2,3,4]        22

我想创建一个新列,新列中的每个条目都将是数据字段中的值,并附加id by~符号,如下所示

Data         ID               New_Column

[1,2,3,4]     22               [1|22~2|22~3|22~4|22]

注意:在数据字段中,数组大小不是固定的。它可能没有条目,或者有n个条目。有人能帮我解决吗!

iih3973s

iih3973s1#

package spark

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DF extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val df = Seq(
    (22, Seq(1,2,3,4)),
    (23, Seq(1,2,3,4,5,6,7,8)),
    (24, Seq())
  ).toDF("ID", "Data")

  val arrUDF = udf((id: Long, array: Seq[Long]) => {
    val r = array.size match {
      case 0 => ""
      case _ => array.map(x => s"$x|$id").mkString("~")
    }

    s"[$r]"
  })

  val resDF = df.withColumn("New_Column", lit(arrUDF('ID, 'Data)))

  resDF.show(false)
  //+---+------------------------+-----------------------------------------+
  //|ID |Data                    |New_Column                               |
  //+---+------------------------+-----------------------------------------+
  //|22 |[1, 2, 3, 4]            |[1|22~2|22~3|22~4|22]                    |
  //|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
  //|24 |[]                      |[]                                       |
  //+---+------------------------+-----------------------------------------+

}
pcrecxhr

pcrecxhr2#

自定义项有助于:

def func(array, suffix):
    return '~'.join([str(x) + '|' + str(suffix) for x in array])

from pyspark.sql.types import StringType
from pyspark.sql import functions as F
my_udf = F.udf(func, StringType())

df.withColumn("New_Column", my_udf("Data", "ID")).show()

印刷品

+------------+---+-------------------+
|        Data| ID|      New_Column   |
+------------+---+-------------------+
|[1, 2, 3, 4]| 22|22~1|22~2|22~3|22~4|
+------------+---+-------------------+
sc4hvdpw

sc4hvdpw3#

Spark2.4+
Pypark的等价物类似于

df = spark.createDataFrame([(22, [1,2,3,4]),(23, [1,2,3,4,5,6,7,8]),(24, [])],['Id','Data'])

df.show()

+---+--------------------+
| Id|                Data|
+---+--------------------+
| 22|        [1, 2, 3, 4]|
| 23|[1, 2, 3, 4, 5, 6...|
| 24|                  []|
+---+--------------------+

df.withColumn('ff', f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))'''))).show(20,False)

+---+------------------------+---------------------------------------+
|Id |Data                    |ff                                     |
+---+------------------------+---------------------------------------+
|22 |[1, 2, 3, 4]            |1|22~2|22~3|22~4|22                    |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23|
|24 |[]                      |                                       |
+---+------------------------+---------------------------------------+

如果希望最终输出为数组

df.withColumn('ff',f.array(f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))''')))).show(20,False)

+---+------------------------+-----------------------------------------+
|Id |Data                    |ff                                       |
+---+------------------------+-----------------------------------------+
|22 |[1, 2, 3, 4]            |[1|22~2|22~3|22~4|22]                    |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
|24 |[]                      |[]                                       |
+---+------------------------+-----------------------------------------+

希望这有帮助

相关问题