我有一个框架:
data = [('service 1','s1', 's2'),
('service 2','s2', 's4'),
('service 3','s3', 's5'),
('service 5','s5', 's6'),
('service 4','s4', 's3')]
sdf = spark.createDataFrame(data, schema = ['description', 'service', 'parent'])
sdf.show()
+-----------+-------+--------+
|description|service|parent |
+-----------+-------+--------+
| service 1| s1| s2|
| service 2| s2| s4|
| service 3| s3| s5|
| service 5| s5| s6|
| service 4| s4| s3|
| service 6| s6| NULL|
+-----------+-------+--------+
字符串
我想看看parent
是否有来自service
列的值,并添加一个新列description
。然后service
和parent
列应该被删除。
这将是理想的最终结果
+-----------+-----------+-----------+-----------+-----------+-----------+
|description| parent| parent1| parent2| parent3| parent4|
+-----------+-----------+-----------+-----------+-----------+-----------+
| service 1| service 2| service 4| service 3| service 5| service 6|
| service 2| service 4| service 3| service 5| service 6| NULL|
| service 3| service 5| service 6| NULL| NULL| NULL|
| service 5| service 6| NULL| NULL| NULL| NULL|
| service 4| service 3| service 5| service 6| NULL| NULL|
+-----------+-----------+-----------+-----------+-----------+-----------+
型
从上一个问题中,我已经能够将该代码用于service
和parent
这两个列,但无法合并description
列。
i = 0
while(sdf.filter(F.col(f"parent{i if i>0 else ''}").isNotNull()).count() > 0):
sdf = sdf.alias("a1").join(sdf.alias("a2").select("service", "parent"),
F.col(f"a1.parent{i if i>0 else ''}")==F.col("a2.service"),
how="left") \
.withColumn(f"parent{i+1}", F.col("a2.parent")) \
.drop(F.col("a2.service")) \
.drop(F.col("a2.parent"))
i += 1
display(sdf)
型
我得到了以下答案:
+-----------+-------+------+-------+
|description|service|parent|parent1|
+-----------+-------+------+-------+
| service 1| s1| s2| s4|
| service 2| s2| s4| s3|
| service 3| s3| s5| s6|
| service 5| s5| s6| null|
| service 4| s4| s3| s5|
+-----------+-------+------+-------+
+-----------+-------+------+-------+-------+
|description|service|parent|parent1|parent2|
+-----------+-------+------+-------+-------+
| service 1| s1| s2| s4| s3|
| service 2| s2| s4| s3| s5|
| service 3| s3| s5| s6| null|
| service 5| s5| s6| null| null|
| service 4| s4| s3| s5| s6|
+-----------+-------+------+-------+-------+
+-----------+-------+------+-------+-------+-------+
|description|service|parent|parent1|parent2|parent3|
+-----------+-------+------+-------+-------+-------+
| service 1| s1| s2| s4| s3| s5|
| service 2| s2| s4| s3| s5| s6|
| service 3| s3| s5| s6| null| null|
| service 5| s5| s6| null| null| null|
| service 4| s4| s3| s5| s6| null|
+-----------+-------+------+-------+-------+-------+
+-----------+-------+------+-------+-------+-------+-------+
|description|service|parent|parent1|parent2|parent3|parent4|
+-----------+-------+------+-------+-------+-------+-------+
| service 1| s1| s2| s4| s3| s5| s6|
| service 2| s2| s4| s3| s5| s6| null|
| service 3| s3| s5| s6| null| null| null|
| service 5| s5| s6| null| null| null| null|
| service 4| s4| s3| s5| s6| null| null|
+-----------+-------+------+-------+-------+-------+-------+
+-----------+-------+------+-------+-------+-------+-------+-------+
|description|service|parent|parent1|parent2|parent3|parent4|parent5|
+-----------+-------+------+-------+-------+-------+-------+-------+
| service 1| s1| s2| s4| s3| s5| s6| null|
| service 2| s2| s4| s3| s5| s6| null| null|
| service 3| s3| s5| s6| null| null| null| null|
| service 5| s5| s6| null| null| null| null| null|
| service 4| s4| s3| s5| s6| null| null| null|
+-----------+-------+------+-------+-------+-------+-------+-------+
型
1条答案
按热度按时间jslywgbw1#
我从这里改编了我以前的回答:
https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177
Graphframe jar可以在以下位置找到:Files:(jar[242KB])
https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12的
要求:
第一个月
阅读这里和这里的主题。
https://docs.databricks.com/en/_extras/notebooks/source/graphframes-user-guide-py.html
https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding
字符串
在路径中,你可以看到从源顶点到目标顶点的路径,你可以收集列表中的中间顶点并将它们分解为列。
输出量:
型