如何在pyspark中将字符串转换为数组?

8aqjt8rx  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(942)

我有一个包含一列的pysparkDataframe

df.show(1)
     table
     [[,,hello,yes],[take,no,I,m],[hi,good,,]....]
     df.printSchema
     root
     |--table:string  (nullable:true)

我的问题是如何将该列转换为数组:t.arraytype(t.arraytype(t.stringtype()))

ctzwtxfj

ctzwtxfj1#

试试这个- spark>=2.4 使用组合 translate 以及 regex_replace ```
val df = Seq("[[,,hello,yes],[take,no,I,m],[hi,good,,]]").toDF("table")
df.show(false)
df.printSchema()
/**
* +-----------------------------------------+
* |table |
* +-----------------------------------------+
* |[[,,hello,yes],[take,no,I,m],[hi,good,,]]|
* +-----------------------------------------+
*
* root
* |-- table: string (nullable = true)
*/

val  p = df.withColumn("arr", split(
  translate(
    regexp_replace($"table", """\]\s*,\s*\[""", "##"), "][", ""
  ), "##"
))

val processed = p.withColumn("arr", expr("TRANSFORM(arr, x -> split(x, ','))"))

processed.show(false)
processed.printSchema()

/**
  * +-----------------------------------------+----------------------------------------------------+
  * |table                                    |arr                                                 |
  * +-----------------------------------------+----------------------------------------------------+
  * |[[,,hello,yes],[take,no,I,m],[hi,good,,]]|[[, , hello, yes], [take, no, I, m], [hi, good, , ]]|
  * +-----------------------------------------+----------------------------------------------------+
  *
  * root
  * |-- table: string (nullable = true)
  * |-- arr: array (nullable = true)
  * |    |-- element: array (containsNull = true)
  * |    |    |-- element: string (containsNull = true)
  */
o2g1uqev

o2g1uqev2#

使用 from_json 函数来自
Spark-2.4+ Example: ```
df.show(10,False)

+---------------------------------------------+

|table |

+---------------------------------------------+

|[['','','hello','yes'],['take','no','i','m']]|

+---------------------------------------------+

df.printSchema()

root

|-- table: string (nullable = true)

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema

sch=ArrayType(ArrayType(StringType()))

df.withColumn("dd",from_json(col("table"),sch)).select("dd").show(10,False)

+------------------------------------+

|dd |

+------------------------------------+

|[[, , hello, yes], [take, no, i, m]]|

+------------------------------------+

schema after converting to array

df.withColumn("dd",from_json(col("table"),sch)).select("dd").printSchema()

root

|-- dd: array (nullable = true)

| |-- element: array (containsNull = true)

| | |-- element: string (containsNull = true)

相关问题