我需要在pyspark的数据框中对列进行散列/分类。
df.printSchema()
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- keys: array (nullable = true)
| |-- element: string (containsNull = true)
字符串
Dataframe 如下所示
df.show()
+----+----+-----------------------------------------------------------+
|col1|col2| keys |
+----+----+-----------------------------------------------------------+
| A| b|array ["name:ck", "birth:FR", "country:FR", "job:Request"] |
| B| d|array ["name:cl", "birth:DE", "country:FR", "job:Request"] | | C| d|array ["birth:FR", "name:ck", "country:FR", "job:Request"] |
+----+----+-----------------------------------------------------------+
型
但是,我在尝试时遇到以下错误:
df_hashed_1 = df\
.withColumn('HashedID', sha2(col('keys'), 256))\
.select('col1', 'col2', 'HashedID')
型
错误cannot resolve 'sha2(spark_catalog.default.posintegrationlogkeysevent.keys, 256)' due to data type mismatch: argument 1 requires binary type, however, 'spark_catalog.df.keys' is of array<string> type.;
。
如何对这种列类型进行散列/分类?
我试过pyspark.sql.functions.sha2
1条答案
按热度按时间v6ylcynt1#
sha2
需要字符串/二进制列,您可以连接数组中的元素:字符串