我有两个Dataframe
Dataframea
name groceries
Mike apple, orange, banana, noodle, red wine
Kate white wine, green beans, extra pineapple hawaiian pizza
Leah red wine, juice, rice, grapes, green beans
Ben water, spaghetti
Dataframeb
id item
0001 red wine
0002 green beans
我将逐行遍历b,并使用regex在dataframe a中搜索商品是否存在于杂货店中
df = None
for keyword in B.select('item').rdd.flatMap(lambda x : x).collect():
if keyword == None:
continue
pattern = '(?i)^'
start = '(?=.*\\b'
end = '\\b)'
for word in re.split('\\s+', keyword):
pattern = pattern + start + word + end
pattern = pattern + '.*$'
if df == None:
df = A.filter(A['groceries'].rlike(pattern)).withColumn('item', F.lit(keyword))
else:
df = df.unionAll(A.filter(A['groceries'].rlike(pattern)).withColumn('item', F.lit(keyword)))
我想要的输出是a中的行,其中包含b中的项,但也包含作为新列插入的item关键字
name groceries item
Mike apple, orange, banana, noodle, red wine red wine
Leah red wine, juice, rice, grapes, green beans red wine
Kate white wine, green beans, extra pineapple hawaiian pizza green beans
Leah red wine, juice, rice, grapes, green beans green beans
实际输出不是我想要的,我不明白这种方法有什么不正确。
我还想知道是否有一种方法可以使用rlike直接连接a和b,这样只有当a中的项目存在于b中的杂货店时,行才会连接。谢谢!
更复杂的数据集
test1 = spark.createDataFrame([("Mike","apple, oranges, red wine"),("Kate","Whitewine, green beans waterrr, pineapple, red wine"), ("Leah", "red wine, juice, rice, grapes, green beans"),("Ben","Water,Spaghetti, the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["name","groceries"])
test2 = spark.createDataFrame([("001","red wine"),("002","green beans waterrr"), ("003", "the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["id","item"])
# %%
test_join =test1.join(test2,F.expr("""groceries rlike item"""),how='inner').show(truncate = False)
+----+---------------------------------------------------+---+-------------------+
|name|groceries |id |item |
+----+---------------------------------------------------+---+-------------------+
|Mike|apple, oranges, red wine |001|red wine |
|Kate|Whitewine, green beans waterrr, pineapple, red wine|001|red wine |
|Kate|Whitewine, green beans waterrr, pineapple, red wine|002|green beans waterrr|
|Leah|red wine, juice, rice, grapes, green beans |001|red wine |
+----+---------------------------------------------------+---+-------------------+
即使有一个确切的关键字匹配“小王子70周年礼物集(书/光盘/可下载音频)”,它仍然不会匹配的结果
test1 = spark.createDataFrame([("Mike","apple, oranges, red wine"),("Kate","Whitewine, green beans waterrr, pineapple, red wine"), ("Leah", "red wine, juice, rice, grapes, green beans"),("Ben","Water,Spaghetti, the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["name","groceries"])
test2 = spark.createDataFrame([("001","red apple"),("002","green beans waterrr"), ("003", "the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["id","item"])
3条答案
按热度按时间q35jwt9p1#
-----如果我用regex做一个rlike来寻找“红苹果”,就像下面这样
rslzwgfq2#
它会给我我想要的,因为我只想确认所有的字,在项目中存在的杂货店,即使他们是不正常的。然而,做下面不会给我以上匹配
解决方案:
它设法做到了我想要的,但仍然跑得很慢。这可能是因为join和udf的使用
xdnvmnnf3#
使用f.expr()可以实现rlike连接。在您的情况下,您需要将其与内部连接一起使用。试试这个,
结果:
对于复杂的数据集,contains()函数必须起作用
结果: