python—通过在列表中传递每个id的值来过滤具有多个id< pks>列的Dataframe

cs7cruho 于 2021-07-26 发布在 Java

关注(0)|答案(1)|浏览(337)

尝试通过在列表中传递每个id的值来过滤具有多个id列的Dataframe。
例如：df:

location_user
transactiontime (string)
user_id (bigint)
location_id (bigint)
Address1 (string)
Address2 (string)
user_name (string)
loc_name (string)

在上面的数据框中：user\u id和location\u id都是id列。
目标：根据Dataframe过滤用户id=[4293942940]和位置id=[14681469]。
创建如下单独列表并将其应用于df.filter。

partition_key =['user_id', 'location_id']
filter_cond = ['[42939,42940]', '[1468,1469]']

--->为单分区密钥工作

filter_df=actual_df.filter(~col(partition_key).isin(filter_cond))

尝试使用下面的分区组合键，但不起作用，出现下面的错误。

filter_df=actual_df.filter(~col(partition_key).isInCollection(filter_cond))

错误：覆盖目录时出错。请检查是否传递了正确的参数。异常：调用z:org.apache.spark.sql.functions.col时出错。trace:py4j.py4jexception:method col（[class java.util.arraylist]）不存在
谢谢你的建议。

sql python DataFrame pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/62945498/filter-the-dataframe-of-having-multiple-idpks-columns-by-passing-the-values-of

1条答案

按热度按时间

zpjtge221#

您可以通过压缩以下条件来实现

partition_key =['id', 'id2']
filter_cond = [[1,2], [100,200]]
cond = ' AND '.join([f'{colname} in {tuple(cond)}' for colname, cond in zip(partition_key,filter_cond)])
print(cond)

df.filter(expr(cond)).show()

# id in (1, 2) AND id2 in (100, 200)

# +---+---+

# | id|id2|

# +---+---+

# |  1|100|

# |  1|200|

# |  2|100|

# |  2|200|

# +---+---+

单个元素的更新

cond = ' AND '.join([f'{colname} in ({",".join(map(str,a))})' for colname, cond in zip(partition_key,filter_cond)])

赞(0）回复(0）举报 2021-07-26

我来回答

python—通过在列表中传递每个id的值来过滤具有多个id< pks>列的Dataframe

1条答案

相关问题

热门标签

最新问答