需要使用过滤器值过滤sparkDataframe,以避免迭代多列

cwdobuhd  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(807)

我在spark dataframe中有以下数据集。我需要根据给定的条件进行过滤:
等于: ID: (6, 7, 8, 9, 13, 15, 16, 18) 不等于: STATE :(Illinois, Oklahoma) , CITY: (Orange, Boca_Raton) 我需要迭代这些列以获得作为键值对的过滤器值,而不是硬编码这些值并过滤Dataframe以获得结果df。
IDnameCitystate1roseannrichmonddvirginia3jamesonfort\uLauderdaleFlorida4marlinewashingtondistrict\uColumbia5IvoryMacongeorgia6tobysan\uDiegoCalifornia7isaccoroleeonlianecalivornia9lannieepeoriaoklahoma10BradleytulsaOklahoma11Teodorapittsburghpennsylvania12benediktatampaflorida13zelmanewport\uNewsCalifornia14Carilynflintmichigan15joey加利福尼亚州博卡市16帕蒂博斯顿市17达科他州比斯马尔克诺思市18格伦德卡图罗克拉霍马市19希尔顿菲尼克萨里佐纳市20巴比特新奥尔良市

6yoyoihd

6yoyoihd1#

你可以用 isin 具有值列表的函数。像这样:

val listIDs = Seq(6, 7, 8, 9, 13, 15, 16, 18)
val listStates = Seq("Illinois", "Oklahoma")
val listCityes = Seq("Orange", "Boca_Raton")

val conditionExpr = Seq(
  col("id").isin(listIDs: _*),
  !col("STATE").isin(listStates: _*),
  !col("CITY").isin(listCityes: _*)
).reduce(_ and _)

val df1 = df.filter(conditionExpr)

df1.show

//+---+------+------------+-------------+
//| id|  NAME|        CITY|        STATE|
//+---+------+------------+-------------+
//|  6|  Toby|   San_Diego|   California|
//| 13| Zelma|Newport_News|   California|
//| 16|Pattie|      Boston|Massachusetts|
//+---+------+------------+-------------+

相关问题