如何在Pyspark RDD中找到公共对，而不管它们的顺序如何？

kcrjzv8t 于 2024-01-06 发布在 Spark

关注(0)|答案(2)|浏览(184)

我想找出这对曾经接触过的人。以下是数据：

Input is
K-\> M, H 
M-\> K, E
H-\> F
B-\> T, H
E-\> K, H
F-\> K, H, E
A-\> Z

字符串
输出为：

Output:
K, M //(this means K has supplied goods to M and M has also supplied some good to K)
H, F

型
下面是我写的代码。

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
from pyspark.ml.regression import LinearRegression
import re
from itertools import combinations
spark = SparkContext("local", "DoubleRDD")
def findpairs(ls):
    lst = []
    for i in range(0,len(ls)-1):
        for j in range(i+1, len(ls)):
            if ls[i] == tuple(reversed(ls[j])):
                lst.append(ls[i])   
    return(lst)
text  = spark.textFile("path to the .txt")
text  = text.map(lambda s: s.replace("->",","))
text  = text.map(lambda s: s.replace(",",""))
text  = text.map(lambda s: s.replace(" ",""))
pairs = text.flatMap(lambda x:  [(x[0],y) for y in x[1:]])
commonpairs = pairs.filter(lambda x: findpairs(x))
pairs.collect()

The output is: []

的字符串

pyspark

来源：https://stackoverflow.com/questions/77691176/how-to-find-common-pairs-irrespective-of-their-order-in-pyspark-rdd

2条答案

按热度按时间

5w9g7ksd1#

不要使用RDD，这个问题可以使用本机spark框架函数来解决。

df = spark.read.csv('data.txt', header=False, sep=r'-\\> ').toDF('x', 'y')
# +---+-------+
# |  x|      y|
# +---+-------+
# |  K|   M, H|
# |  M|   K, E|
# |  H|      F|
# |  B|   T, H|
# |  E|   K, H|
# |  F|K, H, E|
# |  A|     Zs|
# +---+-------+

字符串
拆分并分解收件人（y）列

df1 = df.withColumn('y', F.explode(F.split('y', r',\s+')))
# +---+---+
# |  x|  y|
# +---+---+
# |  K|  M|
# |  K|  H|
# |  M|  K|
# |  M|  E|
# |  H|  F|
# |  B|  T|
# |  B|  H|
# |  E|  K|
# |  E|  H|
# |  F|  K|
# |  F|  H|
# |  F|  E|
# |  A| Zs|
# +---+---+

型
自连接 Dataframe ，其中左边的接收者是右边 Dataframe 中的发送者。然后过滤 Dataframe ，使左边的发送者和右边的接收者相同

df1 = df1.alias('left').join(df1.alias('right'), on=F.expr("left.y == right.x"))
df1 = df1.filter("left.x == right.y")
# +---+---+---+---+
# |  x|  y|  x|  y|
# +---+---+---+---+
# |  K|  M|  M|  K|
# |  M|  K|  K|  M|
# |  H|  F|  F|  H|
# |  F|  H|  H|  F|
# +---+---+---+---+

型
删除发件人和收件人的重复组合

df1 = df1.select('left.*').withColumn('pairs', F.array_sort(F.array('x', 'y')))
df1 = df1.dropDuplicates(['pairs']).drop('pairs')
# +---+---+
# |  x|  y|
# +---+---+
# |  H|  F|
# |  K|  M|
# +---+---+

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

woobm2wo2#

text  = spark.textFile("PATH TO .txt file")
text  = text.map(lambda s: s.replace("->",","))
text  = text.map(lambda s: s.replace(",",""))
text  = text.map(lambda s: s.replace(" ",""))
pairs = text.flatMap(lambda x:  [(tuple(sorted((x[0],y))),1) for y in 
x[1:]]).groupByKey().mapValues(len)
cm = pairs.filter(lambda x: x[1] ==2).collect()
for i in range(0, len(cm)):
    print(cm[i][0])

字符串
我写了上面的代码，它产生了预期的输出。

('K', 'M')
('F', 'H')

型

赞(0）回复(0）举报 2024-01-06

我来回答

如何在Pyspark RDD中找到公共对，而不管它们的顺序如何？

2条答案

相关问题

热门标签

最新问答