python—检查给定列表中的元素是否存在于dataframe的array列中

q3aa0525 于 2021-07-09 发布在 Spark

关注(0)|答案(2)|浏览(649)

我有下面的功能，这对PandasDataframe工作

def event_list(df,steps):
    df['steps_present'] =  df['labels'].apply(lambda x:all(step in x for step in steps))
    return df

dataframe有一个名为labels的列，其值为list。此函数接受dataframe和steps（这是一个列表），如果参数列表中的所有元素都出现在dataframe列中，则输出dataframe并显示一个新列steps

value in df['labels'] =  [EBBY , ABBY , JULIE , ROBERTS]
``` `event_list(df,['EBBY','ABBY'])` 将为该记录返回true，因为ebby和abby在dataframe list列中。
我想在pyspark中创建一个类似的函数。

python apache-spark pyspark apache-spark-sql pandas

来源：https://stackoverflow.com/questions/66912148/to-check-if-elements-in-a-given-list-present-in-array-column-in-dataframe

2条答案

按热度按时间

ugmeyewa1#

您可以将函数转换为自定义项，可能类似于下面的内容。

from pyspark.sql.functions import lit, array

values = [(["EBBY" , "ABBY" , "JULIE" , "ROBERTS"],),
           (["EBBY" , "ABBY"],)]
columns = ['labels']
df = spark.createDataFrame(values, columns)

@udf
def event_list(column_to_test, input_values):
    return all(value in column_to_test for value in input_values)

steps = ["EBBY", "JULIE"]
df.withColumn("steps_present", event_list(df['labels'], array([lit(x) for x in steps]))).show(truncate=False)

赞(0）回复(0）举报 2021-07-09

tvmytwxo2#

你可以用 array_except 检查所提供列表中的每个元素是否都存在于labels列中。如果是，结果的大小 array_except 将是0。将大小与0进行比较将得到所需的布尔值。

import pyspark.sql.functions as F

def event_list(df, steps):
    return df.withColumn(
        'steps_present', 
        F.size(F.array_except(F.array(*[F.lit(l) for l in steps]), 'labels')) == 0
    )

df2 = event_list(df, ["EBBY", "ABBY"])

df2.show(truncate=False)
+----------------------------+-------------+
|labels                      |steps_present|
+----------------------------+-------------+
|[EBBY, ABBY, JULIE, ROBERTS]|true         |
|[EBBY, JULIE]               |false        |
+----------------------------+-------------+

赞(0）回复(0）举报 2021-07-09

我来回答

python—检查给定列表中的元素是否存在于dataframe的array列中

2条答案

相关问题

热门标签

最新问答