在pyspark中散列字符串数组

3okqufwl  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(144)

我需要在pyspark的数据框中对列进行散列/分类。

  1. df.printSchema()
  2. root
  3. |-- col1: string (nullable = true)
  4. |-- col2: string (nullable = true)
  5. |-- keys: array (nullable = true)
  6. | |-- element: string (containsNull = true)

字符串
Dataframe 如下所示

  1. df.show()
  2. +----+----+-----------------------------------------------------------+
  3. |col1|col2| keys |
  4. +----+----+-----------------------------------------------------------+
  5. | A| b|array ["name:ck", "birth:FR", "country:FR", "job:Request"] |
  6. | B| d|array ["name:cl", "birth:DE", "country:FR", "job:Request"] | | C| d|array ["birth:FR", "name:ck", "country:FR", "job:Request"] |
  7. +----+----+-----------------------------------------------------------+


但是,我在尝试时遇到以下错误:

  1. df_hashed_1 = df\
  2. .withColumn('HashedID', sha2(col('keys'), 256))\
  3. .select('col1', 'col2', 'HashedID')


错误cannot resolve 'sha2(spark_catalog.default.posintegrationlogkeysevent.keys, 256)' due to data type mismatch: argument 1 requires binary type, however, 'spark_catalog.df.keys' is of array<string> type.;
如何对这种列类型进行散列/分类?
我试过pyspark.sql.functions.sha2

v6ylcynt

v6ylcynt1#

sha2需要字符串/二进制列,您可以连接数组中的元素:

  1. from pyspark.sql import functions as F
  2. _data = [
  3. (4, 'idA', ['name:ck', 'birth:FR', 'country:FR', 'job:Request'], ),
  4. (5, 'idA', ['name:cl', 'birth:DE', 'country:FR', 'job:Request'], ),
  5. ]
  6. df = spark.createDataFrame(_data, ['col_a', 'col_b', 'keys'])
  7. joined_array = F.array_join('keys', delimiter='')
  8. sha_col = F.sha2(joined_array, 256)
  9. cols = [
  10. F.col('col_a'),
  11. F.col('col_b'),
  12. sha_col.alias('hashed_id'),
  13. ]
  14. df.select(cols).show(10, False)
  15. # +-----+-----+----------------------------------------------------------------+
  16. # |col_a|col_b|hashed_id |
  17. # +-----+-----+----------------------------------------------------------------+
  18. # |4 |idA |fd9016141123b1a2b1f07bbc798a727293c0467a206f2a32096e5c310ebd6a26|
  19. # |5 |idA |7845f6f4fa706c7ed3748dd21924d192bd1b443797b2349f81144df1185f2bb6|
  20. # +-----+-----+----------------------------------------------------------------+

字符串

展开查看全部

相关问题