我正在尝试在pyspark中加入2个Dataframe。我的问题是我想让我的“内部连接”给它一个通行证,不管空值。我可以看到,在scala中,我有一个<=>。但是,<=>在pyspark中不起作用。
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u's.kent@email.com'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace@email.com'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh@email.com')]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace@email.com'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh@email.com')]).toDF()
当前工作版本: userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show()
当前结果:
+--------------------+----------+---+---------+--------------------+----------+---+---------+
| email|first_name| id|last_name| email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|marge.peace@email...| Margaret| 2| Peace|marge.peace@email...| Margaret| 2| Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
预期结果:
+--------------------+----------+---+---------+--------------------+----------+---+---------+
| email|first_name| id|last_name| email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
| marge.hh@email.com| null| 3| hh| marge.hh@email.com| null| 3| hh|
|marge.peace@email...| Margaret| 2| Peace|marge.peace@email...| Margaret| 2| Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
2条答案
按热度按时间yquaqz181#
使用另一个值而不是
null
:ygya80vv2#
对于pyspark<2.3.0,仍然可以使用如下表达式列构建<=>运算符:
对于pyspark>=2.3.0,可以使用column.eqnullsafe或与此处回答的不同。