python-3.x 在pyspark中使用foreach()

huwehgph 于 2023-06-07 发布在 Python

关注(0)|答案(1)|浏览(180)

我有一个pyspark DataFrame，其中包含一个名为primary_use的列。
下面是第一行：

要创建一个布尔向量，以指示某行中的primary_use是Education还是Office，我使用以下代码。但是，它返回None，这会导致异常：

def is_included_in(row):
    
    return(row['primary_use'] in ['Education', 'Office'])

building.foreach(is_included_in).show()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-124-03dd626371bf> in <module>
----> 1 building.foreach(is_included_in).show()

AttributeError: 'NoneType' object has no attribute 'show'

为什么会出现这种结果，我该如何解决？

python-3.x

来源：https://stackoverflow.com/questions/59309504/using-foreach-in-pyspark

1条答案

按热度按时间

uurity8g1#

pyspark foreach不产生新的转换后的 Dataframe 。Foreach允许遍历每个记录并执行一些非返回操作-例如，写入磁盘或调用一些外部API
另外，该函数实际上调用了df.rdd.foreach。Rdd是底层的dataframe API。这是更低的水平。转换每条记录的正确rdd API是Rdd.map
dataframe API还提供了运行标量Map用户定义函数的可能性。最新的是Pandasudf
这样的isin函数已经是标准spark sql API的一部分了。

df = df.withColumn('is_included', df.primary_use.isin(['Education', 'Office']))

赞(0）回复(0）举报 2023-06-07

我来回答

python-3.x 在pyspark中使用foreach()

1条答案

相关问题

热门标签

最新问答