我有一个包含一列的pysparkDataframe
df.show(1)
table
[[,,hello,yes],[take,no,I,m],[hi,good,,]....]
df.printSchema
root
|--table:string (nullable:true)
我的问题是如何将该列转换为数组:t.arraytype(t.arraytype(t.stringtype()))
我有一个包含一列的pysparkDataframe
df.show(1)
table
[[,,hello,yes],[take,no,I,m],[hi,good,,]....]
df.printSchema
root
|--table:string (nullable:true)
我的问题是如何将该列转换为数组:t.arraytype(t.arraytype(t.stringtype()))
2条答案
按热度按时间ctzwtxfj1#
试试这个-
spark>=2.4
使用组合translate
以及regex_replace
```val df = Seq("[[,,hello,yes],[take,no,I,m],[hi,good,,]]").toDF("table")
df.show(false)
df.printSchema()
/**
* +-----------------------------------------+
* |table |
* +-----------------------------------------+
* |[[,,hello,yes],[take,no,I,m],[hi,good,,]]|
* +-----------------------------------------+
*
* root
* |-- table: string (nullable = true)
*/
o2g1uqev2#
使用
from_json
函数来自Spark-2.4+
Example:
```df.show(10,False)
+---------------------------------------------+
|table |
+---------------------------------------------+
|[['','','hello','yes'],['take','no','i','m']]|
+---------------------------------------------+
df.printSchema()
root
|-- table: string (nullable = true)
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema
sch=ArrayType(ArrayType(StringType()))
df.withColumn("dd",from_json(col("table"),sch)).select("dd").show(10,False)
+------------------------------------+
|dd |
+------------------------------------+
|[[, , hello, yes], [take, no, i, m]]|
+------------------------------------+
schema after converting to array
df.withColumn("dd",from_json(col("table"),sch)).select("dd").printSchema()
root
|-- dd: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)