如何从列表列创建组合的pysparkDataframe

7gyucuyw 于 2021-07-13 发布在 Spark

关注(0)|答案(1)|浏览(379)

我现在有一个PyparkDataframe，如下所示：

+--------------------+
|               items|
+--------------------+
|        [1, 2, 3, 4]|
|           [1, 5, 7]|
|             [9, 10]|
|                 ...|

我的目标是转换这个Dataframe（或创建一个新的Dataframe），以便新数据是表中项目的两个长度组合。
我知道 itertools.combinations 可以创建列表的组合，但我正在寻找一种方法来有效地对大量数据执行此操作，但我无法找出如何将其与pyspark集成。
示例结果：

+-------------+-------------+
|        item1|        item2|
+-------------+-------------+
|            1|            2|
|            2|            1|
|            1|            3|
|            3|            1|
|            1|            4|
|            4|            1|
|            2|            3|
|            3|            2|
|            2|            4|
|            4|            2|
|            3|            4|
|            4|            3|
|            1|            5|
|            5|            1|
|            1|            7|
|            7|            1|
|            5|            7|
|            7|            5|
|            9|           10|
|           10|            9|
|                        ...|

python apache-spark pyspark apache-spark-sql pyspark-dataframes

来源：https://stackoverflow.com/questions/66109410/how-to-create-a-pyspark-dataframe-of-combinations-from-list-column

1条答案

按热度按时间

hpxqektj1#

你可以用 itertools.combinations 使用自定义项：

import itertools
from pyspark.sql import functions as F

combinations_udf = F.udf(
    lambda x: list(itertools.combinations(x, 2)),
    "array<struct<item1:int,item2:int>>"
)

df1 = df.withColumn("items", F.explode(combinations_udf(F.col("items")))) \
    .selectExpr("items.*")

df1.show()

# +-----+-----+

# |item1|item2|

# +-----+-----+

# |1    |2    |

# |1    |3    |

# |1    |4    |

# |2    |3    |

# |2    |4    |

# |3    |4    |

# |1    |5    |

# |1    |7    |

# |5    |7    |

# |9    |10   |

# +-----+-----+

赞(0）回复(0）举报 2021-07-13

我来回答

如何从列表列创建组合的pysparkDataframe

1条答案

相关问题

热门标签

最新问答