如何在pyspark上将所有函数组合成一列?

sauutmhj  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(496)

目前,我正在尝试将所有功能合并到一个名为“性别”的专栏中。我已经用Pandas成功地做到了这一点,但现在我想用Pypark做到这一点,它是有点不同的Pandas相比。我无法调用函数 .apply 在Pypark。
这是我用Pandas做的版本:

  1. df['Gender'] = df['Gender'].str.lower()
  2. male = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"]
  3. female = ["cis female", "f", "female", "woman", "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female", "trans woman", "female (trans)"]
  4. other = ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]
  5. new_df['Gender'] = new_df['Gender'].apply(lambda x:"Male" if x in male else x)
  6. new_df['Gender'] = new_df['Gender'].apply(lambda x:"Female" if x in female else x)
  7. new_df['Gender'] = new_df['Gender'].apply(lambda x:"Other" if x in other else x)

这是我尝试使用pyspark复制的版本,但是我很难将所有转换的值放回“gender”列:

  1. from pyspark.sql.functions import lower, col, udf
  2. import pyspark.sql.functions as f
  3. na_df = na_df.withColumn('Gender', lower(col('Gender')))
  4. Male = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"]
  5. Female = ["cis female", "f", "female", "woman", "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female", "trans woman", "female (trans)"]
  6. Other = ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]
  7. na_df2 = na_df.withColumn('Gender',f.when(f.col('Gender').isin(Male),f.lit('Male')).\
  8. when(f.col('Gender').isin(Other),f.lit('Other')).\
  9. when(f.col('Gender').isin(Female),f.lit('Female')).\
  10. otherwise(f.col('Gender'))).show()
  11. na_df2.select('Gender').distinct().show()

这是我尝试的解决方案的另一个版本,但它给了我一个错误:无法将列转换为bool:

  1. from pyspark.sql.functions import lower, col, udf
  2. na_df = na_df.withColumn('Gender', lower(col('Gender')))
  3. genders = {
  4. 'Male': ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"],
  5. 'Female': ["cis female", "f", "female", "woman", "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female", "trans woman", "female (trans)"],
  6. 'Other': ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]
  7. }
  8. na_df.withColumn('Gender', (lambda x: [g for g in genders if x in genders[g]][0])(col('Gender'))).show()

结果,我得到的是,列“性别”还没有更新,所以请就我可以做什么来解决这个问题的建议。提前谢谢!

uubf1zoe

uubf1zoe1#

您可以通过在函数

  1. import pyspark.sql.functions as f
  2. +---+----------+
  3. | id| gender|
  4. +---+----------+
  5. | 1| male|
  6. | 1| m|
  7. | 1| male-ish|
  8. | 1| maile|
  9. | 1| mal|
  10. | 1|male (cis)|
  11. | 1| make|
  12. | 1| male |
  13. | 1| man|
  14. | 1| msle|
  15. | 1| mail|
  16. | 1| malr|
  17. | 1| cis man|
  18. | 1| cis male|
  19. | 1|cis female|
  20. | 1| f|
  21. | 1| female|
  22. | 1| woman|
  23. | 1| femake|
  24. | 1| female |
  25. +---+----------+
  26. df = df.withColumn('gender',f.when(f.col('gender').isin(male),f.lit('Male')).\
  27. when(f.col('gender').isin(other),f.lit('Other')).\
  28. when(f.col('gender').isin(female),f.lit('Female')).\
  29. otherwise(f.col('gender')))
  30. df.select('Gender').distinct().show()
  31. +---+------+
  32. | id|gender|
  33. +---+------+
  34. | 1| Male|
  35. | 1| Male|
  36. | 1| Male|
  37. | 1| Male|
  38. | 1| Male|
  39. | 1| Male|
  40. | 1| Male|
  41. | 1| Male|
  42. | 1| Male|
  43. | 1| Male|
  44. | 1| Male|
  45. | 1| Male|
  46. | 1| Male|
  47. | 1| Male|
  48. | 1|Female|
  49. | 1|Female|
  50. | 1|Female|
  51. | 1|Female|
  52. | 1|Female|
  53. | 1|Female|
  54. +---+------+
展开查看全部

相关问题