将字符串转换为arraytype(doubletype)pysparkDataframe

fxnxkyjh 于 2021-05-29 发布在 Spark

关注(0)|答案(4)|浏览(587)

我在spark中有一个具有以下模式的Dataframe：模式：

StructType(List(StructField(id,StringType,true),
StructField(daily_id,StringType,true),
StructField(activity,StringType,true)))

列活动是字符串，示例内容：
{1.33,0.567,1.897,0,0.78}
我需要将列活动强制转换为arraytype（doubletype）
为了做到这一点，我运行了以下命令：

df = df.withColumn("activity",split(col("activity"),",\s*").cast(ArrayType(DoubleType())))

Dataframe的新架构相应更改：

StructType(List(StructField(id,StringType,true),
StructField(daily_id,StringType,true),
StructField(activity,ArrayType(DoubleType,true),true)))

但是，现在的数据如下所示：[null，0.567,1.897,0，null]
它将字符串数组的第一个和最后一个元素更改为null。我不明白spark为什么要用Dataframe做这个。
请问有什么问题？
非常感谢

python DataFrame apache-spark casting Arrays

来源：https://stackoverflow.com/questions/62342328/casting-string-to-arraytypedoubletype-pyspark-dataframe

4条答案

按热度按时间

nwsw7zdq1#

因为
以下代码不替换 { & } ```
df.withColumn("activity",F.split(F.col("activity"),",\s*")).show(truncate=False)
+-------------------------------+
|activity |
+-------------------------------+
|[{1.33, 0.567, 1.897, 0, 0.78}]|
+-------------------------------+

当你试着把这些 `{1.33` &  `0.78}` 字符串值到 `DoubleType` 你会得到 `null` 作为输出。

df.withColumn("activity",F.split(F.col("activity"),",\s*").cast(ArrayType(DoubleType()))).show(truncate=False)
+----------------------+
|activity |
+----------------------+
|[, 0.567, 1.897, 0.0,]|
+----------------------+

改变这个

df.withColumn("activity",split(col("activity"),",\s*").cast(ArrayType(DoubleType())))

至

from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType
from pyspark.sql.types import DoubleType

df.select(F.split(F.regexp_replace(F.col("activity"),"[{ }]",""),",").cast("array").alias("activity"))

展开查看全部

赞(0）回复(0）举报 2021-05-29

0g0grzrc2#

试试这个-

val df = Seq("{1.33,0.567,1.897,0,0.78}").toDF("activity")
    df.show(false)
    df.printSchema()
    /**
      * +-------------------------+
      * |activity                 |
      * +-------------------------+
      * |{1.33,0.567,1.897,0,0.78}|
      * +-------------------------+
      *
      * root
      * |-- activity: string (nullable = true)
      */
    val processedDF = df.withColumn("activity",
      split(regexp_replace($"activity", "[^0-9.,]", ""), ",").cast("array<double>"))
    processedDF.show(false)
    processedDF.printSchema()
    /**
      * +-------------------------------+
      * |activity                       |
      * +-------------------------------+
      * |[1.33, 0.567, 1.897, 0.0, 0.78]|
      * +-------------------------------+
      *
      * root
      * |-- activity: array (nullable = true)
      * |    |-- element: double (containsNull = true)
      */

展开查看全部

赞(0）回复(0）举报 2021-05-29

gfttwv5a3#

使用spark sql的简单方法（没有regex）：

df2=(df1
     .withColumn('col1',expr("""
     transform(
     split(
     substring(activity,2,length(activity)-2),','),
     x->DOUBLE(x))
     """))
    )

赞(0）回复(0）举报 2021-05-29

2cmtqfgy4#

这是因为你的第一个和最后一个字母是括号本身，因此将其转换为null

testdf.withColumn('activity',f.split(f.col('activity').substr(f.lit(2),f.length(f.col('activity'))-2),',').cast(t.ArrayType(t.DoubleType()))).show(2, False)

赞(0）回复(0）举报 2021-05-29

我来回答

将字符串转换为arraytype(doubletype)pysparkDataframe

4条答案

相关问题

热门标签

最新问答