While Loop to Pyspark中的连接

zsohkypk 于 2024-01-06 发布在 Spark

关注(0)|答案(1)|浏览(148)

我有一个框架：

data = [('service 1','s1', 's2'),
       ('service 2','s2', 's4'),
       ('service 3','s3', 's5'),
       ('service 5','s5', 's6'),
       ('service 4','s4', 's3')]
sdf = spark.createDataFrame(data, schema = ['description', 'service', 'parent'])
sdf.show()
+-----------+-------+--------+
|description|service|parent  |
+-----------+-------+--------+
|  service 1|     s1|      s2|
|  service 2|     s2|      s4|
|  service 3|     s3|      s5|
|  service 5|     s5|      s6|
|  service 4|     s4|      s3|
|  service 6|     s6|    NULL|
+-----------+-------+--------+

字符串
我想看看parent是否有来自service列的值，并添加一个新列description。然后service和parent列应该被删除。
这将是理想的最终结果

+-----------+-----------+-----------+-----------+-----------+-----------+
|description|     parent|    parent1|    parent2|    parent3|    parent4|
+-----------+-----------+-----------+-----------+-----------+-----------+
|  service 1|  service 2|  service 4|  service 3|  service 5|  service 6|
|  service 2|  service 4|  service 3|  service 5|  service 6|       NULL|
|  service 3|  service 5|  service 6|       NULL|       NULL|       NULL|
|  service 5|  service 6|       NULL|       NULL|       NULL|       NULL|
|  service 4|  service 3|  service 5|  service 6|       NULL|       NULL|
+-----------+-----------+-----------+-----------+-----------+-----------+

型
从上一个问题中，我已经能够将该代码用于service和parent这两个列，但无法合并description列。

i = 0
while(sdf.filter(F.col(f"parent{i if i>0 else ''}").isNotNull()).count() > 0):
  sdf = sdf.alias("a1").join(sdf.alias("a2").select("service", "parent"),
                            F.col(f"a1.parent{i if i>0 else ''}")==F.col("a2.service"), 
                            how="left") \
           .withColumn(f"parent{i+1}", F.col("a2.parent")) \
          .drop(F.col("a2.service")) \
          .drop(F.col("a2.parent")) 
  i += 1
  display(sdf)

型
我得到了以下答案：

+-----------+-------+------+-------+
|description|service|parent|parent1|
+-----------+-------+------+-------+
|  service 1|     s1|    s2|     s4|
|  service 2|     s2|    s4|     s3|
|  service 3|     s3|    s5|     s6|
|  service 5|     s5|    s6|   null|
|  service 4|     s4|    s3|     s5|
+-----------+-------+------+-------+
+-----------+-------+------+-------+-------+
|description|service|parent|parent1|parent2|
+-----------+-------+------+-------+-------+
|  service 1|     s1|    s2|     s4|     s3|
|  service 2|     s2|    s4|     s3|     s5|
|  service 3|     s3|    s5|     s6|   null|
|  service 5|     s5|    s6|   null|   null|
|  service 4|     s4|    s3|     s5|     s6|
+-----------+-------+------+-------+-------+
+-----------+-------+------+-------+-------+-------+
|description|service|parent|parent1|parent2|parent3|
+-----------+-------+------+-------+-------+-------+
|  service 1|     s1|    s2|     s4|     s3|     s5|
|  service 2|     s2|    s4|     s3|     s5|     s6|
|  service 3|     s3|    s5|     s6|   null|   null|
|  service 5|     s5|    s6|   null|   null|   null|
|  service 4|     s4|    s3|     s5|     s6|   null|
+-----------+-------+------+-------+-------+-------+
+-----------+-------+------+-------+-------+-------+-------+
|description|service|parent|parent1|parent2|parent3|parent4|
+-----------+-------+------+-------+-------+-------+-------+
|  service 1|     s1|    s2|     s4|     s3|     s5|     s6|
|  service 2|     s2|    s4|     s3|     s5|     s6|   null|
|  service 3|     s3|    s5|     s6|   null|   null|   null|
|  service 5|     s5|    s6|   null|   null|   null|   null|
|  service 4|     s4|    s3|     s5|     s6|   null|   null|
+-----------+-------+------+-------+-------+-------+-------+
+-----------+-------+------+-------+-------+-------+-------+-------+
|description|service|parent|parent1|parent2|parent3|parent4|parent5|
+-----------+-------+------+-------+-------+-------+-------+-------+
|  service 1|     s1|    s2|     s4|     s3|     s5|     s6|   null|
|  service 2|     s2|    s4|     s3|     s5|     s6|   null|   null|
|  service 3|     s3|    s5|     s6|   null|   null|   null|   null|
|  service 5|     s5|    s6|   null|   null|   null|   null|   null|
|  service 4|     s4|    s3|     s5|     s6|   null|   null|   null|
+-----------+-------+------+-------+-------+-------+-------+-------+

型

pyspark

来源：https://stackoverflow.com/questions/77619613/while-loop-to-iterate-joins-in-pyspark

1条答案

按热度按时间

jslywgbw1#

我从这里改编了我以前的回答：
https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177
Graphframe jar可以在以下位置找到：Files：（jar[242KB]）
https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12的
要求：
第一个月

阅读这里和这里的主题。

https://docs.databricks.com/en/_extras/notebooks/source/graphframes-user-guide-py.html

https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding

import sys
from pyspark import SQLContext
from pyspark.sql.functions import *
from graphframes import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
##### Adding the graphframes jar so that we can access GraphX API of Apache Spark in pyspark
## Jars can be found at this location : https://spark-packages.org/package/graphframes/graphframes
spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.jars", "file://spark-jars/graphframes-0.8.2-spark3.2-s_2.12.jar") \
    .getOrCreate()
sc = spark.sparkContext
sc.setCheckpointDir("/tmp/whatever")
sqlContext = SQLContext(sc)
data = [
       ('service 1', 's1', 's2'),
       ('service 2', 's2', 's4'),
       ('service 3', 's3', 's5'),
       ('service 5', 's5', 's6'),
       ('service 4', 's4', 's3'),
       ('service 6', 's6', 'Leaf')
        ]
initial_df = spark.createDataFrame(data, schema=['description', 'service', 'parent'])
initial_df.show()
### Giving unique id to the leaf Node so that we can aggregate the results later properly
initial_df = initial_df.withColumn("relationshipId", F.monotonically_increasing_id())
source_vertex_list = initial_df.select(F.col("service").alias("vertices")).distinct()
destination_vertex_list = initial_df.select(F.col("parent").alias("vertices")).distinct()
#second_vertex_list = initial_df.select(F.col("parent").alias("vertices"), F.lit("No").alias("isLeafNode"), F.lit(0).alias("salary")).distinct()
intermediate_df = source_vertex_list.union(destination_vertex_list).distinct()
final_vertices_df = intermediate_df.withColumn("id", F.col("vertices")).drop("vertices")
print("All vertices list")
final_vertices_df.show(n=1000, truncate=False)
given_edge_list_df = initial_df.select(F.col("service").alias("src"), F.col("parent").alias("dst"))
given_edge_list_df.show(n=1000, truncate=False)
print("All vertices")
final_vertices_df.show(n=1000, truncate=False)
print("All relationships")
given_edge_list_df.show(n=1000, truncate=False)
## Creating a graph representation of the vertices and edges relationship
g = GraphFrame(final_vertices_df, given_edge_list_df)
## Read about motifs here and here.
# https://docs.databricks.com/en/_extras/notebooks/source/graphframes-user-guide-py.html
# https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding
paths = g.bfs("id = 's1'", "id = 'Leaf'")
paths.show()
result = g.connectedComponents()
print("Connected components")
result.select("id", "component").orderBy("component").show()

字符串

在路径中，你可以看到从源顶点到目标顶点的路径，你可以收集列表中的中间顶点并将它们分解为列。

输出量：

+-----------+-------+------+
|description|service|parent|
+-----------+-------+------+
|  service 1|     s1|    s2|
|  service 2|     s2|    s4|
|  service 3|     s3|    s5|
|  service 5|     s5|    s6|
|  service 4|     s4|    s3|
|  service 6|     s6|  Leaf|
+-----------+-------+------+
All vertices list
+----+
|id  |
+----+
|s6  |
|s5  |
|s4  |
|s2  |
|s3  |
|s1  |
|Leaf|
+----+
+---+----+
|src|dst |
+---+----+
|s1 |s2  |
|s2 |s4  |
|s3 |s5  |
|s5 |s6  |
|s4 |s3  |
|s6 |Leaf|
+---+----+
All vertices
+----+
|id  |
+----+
|s6  |
|s5  |
|s4  |
|s2  |
|s3  |
|s1  |
|Leaf|
+----+
All relationships
+---+----+
|src|dst |
+---+----+
|s1 |s2  |
|s2 |s4  |
|s3 |s5  |
|s5 |s6  |
|s4 |s3  |
|s6 |Leaf|
+---+----+
+----+--------+----+--------+----+--------+----+--------+----+--------+----+----------+------+
|from|      e0|  v1|      e1|  v2|      e2|  v3|      e3|  v4|      e4|  v5|        e5|    to|
+----+--------+----+--------+----+--------+----+--------+----+--------+----+----------+------+
|{s1}|{s1, s2}|{s2}|{s2, s4}|{s4}|{s4, s3}|{s3}|{s3, s5}|{s5}|{s5, s6}|{s6}|{s6, Leaf}|{Leaf}|
+----+--------+----+--------+----+--------+----+--------+----+--------+----+----------+------+
Connected components
+----+---------+
|  id|component|
+----+---------+
|  s6|        0|
|  s5|        0|
|  s4|        0|
|  s2|        0|
|  s3|        0|
|  s1|        0|
|Leaf|        0|
+----+---------+

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

我来回答

While Loop to Pyspark中的连接

1条答案

阅读这里和这里的主题。

https://docs.databricks.com/en/_extras/notebooks/source/graphframes-user-guide-py.html

在路径中，你可以看到从源顶点到目标顶点的路径，你可以收集列表中的中间顶点并将它们分解为列。

相关问题

热门标签

最新问答