如何通过for循环迭代一列并获得pyspark值?

ovfsdjhp  于 2021-07-12  发布在  Spark
关注(0)|答案(4)|浏览(739)

我有一个Dataframe,我想进一步处理特定列的值。如何在我的代码pyspark中获取值

for i in range(0,df.count()):

            df_year = df['year'][i]
            print(df_year)

我得到这样的输出

Column<b'year'>
Column<b'year'>

这是我的预期产出

2015
2016
fdbelqdn

fdbelqdn1#

for row in df.rdd.collect():
     print(row['year'])
fivyi3re

fivyi3re2#

如果你只想要年份栏,

for row in df.select("year").rdd.collect():
    print(row['year'])
rkkpypqq

rkkpypqq3#

你可以试试这个-

>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> sc = SparkContext.getOrCreate()
>>> sql = SQLContext(sc)
>>> df = sql.createDataFrame([(2015, 4), (2016, 5),(2017,6),(2018,7)], ["Year", "Month"])
>>> df.show()
+----+-----+
|Year|Month|
+----+-----+
|2015|    4|
|2016|    5|
|2017|    6|
|2018|    7|
+----+-----+
>>> [x.Year for x in df.select("Year").collect()]
[2015, 2016, 2017, 2018]
oxosxuxt

oxosxuxt4#

for i in range(0,df.count()):
    df_year=df.collect()[i][1]
    print(df_year)

其中1是从零开始的列索引。

相关问题