Python函数向pyspark df添加二进制列

jtoj6r0c  于 2024-01-06  发布在  Spark
关注(0)|答案(2)|浏览(140)

我有一个类似productusage的框架:
| featureSk|个人号码|
| --|--|
| 一| 1001 |
| B| 1001 |
| C| 1003 |
| C| 1004 |
| 一| 1002 |
| B| 1005 |
我需要创建一个python函数,它有一个人的号码列表作为输入,并输出一个dateframe,其中有featureSk列的值从productusage作为列。基本上应该有一个列的每个featureSk值和一个0的行,如果PersonNumber不存在于productusage和1,如果它确实存在于productusage
输出应该是一个pandas框架,如:
| 个人号码|一|B| C|
| --|--|--|--|
| 1001 | 1 | 1 | 0 |
| 1002 | 0 | 0 | 0 |
| 1003 | 0 | 0 | 1 |
这就是我所尝试的

  1. def add_featureSk_to_dataframe(Person_list):
  2. Person_list = pd.DataFrame(Person_list)
  3. df = productusage
  4. unique_values = df[featureSk].unique()
  5. for value in unique_vaues:
  6. for person in Persons_list:
  7. df = df.withColumn(value, lambda person: 1 if person in Persons_list else 0)
  8. return df
  9. person_test = [1001,1002,1003]
  10. add_featureSk_to_dataframe(person_test)

字符串
得到一个错误,即使定义了productusage,也没有定义featureSk

9q78igpj

9q78igpj1#

使用pd.crosstab

  1. out = pd.crosstab(df["PersonNumber"], df["featureSk"])
  2. vals = [1001, 1002, 1003]
  3. # with .reindex the missing vals are filled with 0
  4. print(out.reindex(vals, fill_value=0))

字符串
印刷品:

  1. featureSk A B C
  2. PersonNumber
  3. 1001 1 1 0
  4. 1002 1 0 0
  5. 1003 0 0 1

展开查看全部
zazmityj

zazmityj2#

  1. def person_has_product(person_list):
  2. df = dfPersonQuery
  3. #Distinct product names
  4. products = df.select("featureSk").distinct()
  5. # Filter df for the required persons
  6. filtered_df = df.filter(col("personnumber").isin(person_list))
  7. # Perform crosstab on the person and product columns
  8. cross_tab_result = filtered_df.crosstab("personnumber", "featureSk").withColumnRenamed("personnumber_featureSk", "personnumber")
  9. # Iterate through the distinct products in featureSk column
  10. for column in cross_tab_result.drop("personnumber").columns:
  11. cross_tab_result = cross_tab_result.withColumn(column,when(col(column) > 0, 1).otherwise(0))
  12. return print(cross_tab_result.toPandas())
  13. person_lst =[1001, 1002, 1003]
  14. person_has_product(person_lst)
  15. `

字符串

展开查看全部

相关问题