pyspark-将带有month number的Dataframe列转换为另一个带有month name的Dataframe列

jdzmm42g  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(497)

我正在尝试将dataframe month number列转换为相应的month name列。我尝试了以下方法-

df_month_name = df.withColumn('month_name',calendar.month_abbr['MONTH_NUMBER'])

我有个错误:

AttributeError: 'function' object has no attribute 'month_abbr'

如果还有更好的办法,请告诉我。谢谢!

3hvapo4f

3hvapo4f1#

你可以用 to_date 要将月份转换为日期,请使用 date_format 要获取月份名称:

from pyspark.sql import functions as F

df = spark.createDataFrame([("1",), ("2",), ("3",), ("4",), ("5",)], ["month_number"])

df1 = df.withColumn("month_name", F.date_format(F.to_date("month_number", "MM"), "MMMM")) \
    .withColumn("month_abbr", F.date_format(F.to_date("month_number", "MM"), "MMM"))

df1.show()

# +------------+----------+----------+

# |month_number|month_name|month_abbr|

# +------------+----------+----------+

# |           1|   January|       Jan|

# |           2|  February|       Feb|

# |           3|     March|       Mar|

# |           4|     April|       Apr|

# |           5|       May|       May|

# +------------+----------+----------+

请注意,对于spark 3,需要设置 spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") 将月数转换为日期。
也可以使用保存Map的Map列 month_number -> month_abbr :

import calendar
import itertools
from pyspark.sql import functions as F

months = F.create_map(*[
    F.lit(m) for m in itertools.chain(*[(x, calendar.month_abbr[x]) for x in range(1, 12, 1)])
])

df1 = df.withColumn("month_abbr", months[F.col("month_number")])

使用自定义项的另一种方法:

import calendar
from pyspark.sql import functions as F

month_name = F.udf(lambda x: calendar.month_name[int(x)])
month_abbr = F.udf(lambda x: calendar.month_abbr[int(x)])

df1 = df.withColumn("month_name", month_name(F.col("month_number"))) \
    .withColumn("month_abbr", month_abbr(F.col("month_number")))
n1bvdmb6

n1bvdmb62#

如果有人想在scala中执行此操作,可以按以下方式执行:

//Sample Data
val df = Seq(("1"),("2"),("3"),("4"),("5"),("6"),("7"),("8"),("9"),("10"),("11"),("12")).toDF("month_number")

import org.apache.spark.sql.functions._
val df1 = df.withColumn("Month_Abbr",date_format(to_date($"month_number","MM"),"MMM"))
display(df1)

您可以看到如下输出:

相关问题