python—将dataframe值与值列表进行比较,得到dataframe中不存在的值列表

2uluyalo  于 2021-07-13  发布在  Spark
关注(0)|答案(3)|浏览(502)

我需要比较dataframe值和list值,得到dataframe中不存在的值的列表。
请有人帮帮我!

list =['682.9', '682.12', '682.11', '682.13', '682.14', '682.15']

Dataframe:

+-----------+
|sheetnumber|
+-----------+
|     682.11|
|     682.12|
|     682.13|
|     682.14|
|     682.15|
|      783.4|
+-----------+

预期产量:

['682.9']
tzdcorbm

tzdcorbm1#

您可以将列表转换为pysparkDataframe,并与其他Dataframe进行左反联接:

my_list = ['682.9', '682.12', '682.11', '682.13', '682.14', '682.15']
mylist_df = spark.createDataFrame([(n,) for n in my_list], ["sheetnumber"])

result = mylist_df.join(df, ["sheetnumber"], "left_anti")
output = [row.sheetnumber for row in result.collect()]

print(output)

# ['682.9']
7jmck4yq

7jmck4yq2#

你可以用 exceptAll 要将列表(转换为Dataframe)与Dataframe进行比较,请执行以下操作:

mylist = ['682.9', '682.12', '682.11', '682.13', '682.14', '682.15']

diff = spark.createDataFrame([[i] for i in mylist]).exceptAll(df)

diff.show()
+-----+
|   _1|
+-----+
|682.9|
+-----+

为了把结果列出来,你可以

result = [r[0] for r in diff.collect()]

# ['682.9']
cbwuti44

cbwuti443#

您可以使用filter/where语句并检查它是否在列表中

my_list = ['682.9', '682.12', '682.11', '682.13', '682.14', '682.15']

df2=df.where(~df.sheetnumber.isin(my_list))
df2.show()

+-----------+
|sheetnumber|
+-----------+
|      783.4|
+-----------+

df3=df.filter(~df.sheetnumber.isin(my_list))
df3.show()

+-----------+
|sheetnumber|
+-----------+
|      783.4|
+-----------+

这个 ~ 代表不

相关问题