使用PySpark中的to_date转换荷兰语中具有不同格式和月份缩写的日期字符串

e0bqpujr  于 12个月前  发布在  Spark
关注(0)|答案(3)|浏览(142)

我需要将日期字符串转换为DateType,但使用to_date时遇到了一些挑战。
day的缩写很好用(1或2位),month是荷兰语缩写,不起作用(只有当缩写等于英语时才起作用),year是2或4位(缺少世纪!)。
什么是最好的方式将这些都转换为DateType?
我找不到一个选项来设置本地语言为NL使用格式。
我创建了一个UDF,但不知道这是否是解决这个问题的最佳方法。
世纪的19是有争议的。

产品代码:

@F.udf(T.StringType())
def convert_date(s):
    
    month_dict = {"jan":"01", "feb":"02", "mrt":"03", "apr":"04", "mei":"05", "jun":"06", "jul":"07", "aug":"08", "sep":"09", "okt":"10", "nov":"11", "dec":"12" }
    
    day, month, year = s.split("-")
    if len(day) == 1:
        day = '0' + day
    if len(year) < 4:
        year = '19' + year
        
    date = day + "-" + month_dict[month] + "-" + year
        
    return date
  
df = df.withColumn('DateOfBirth_new', F.to_date(convert_date(F.col("DateOfBirth"), "dd-M-yyyy"))

字符串

日期框:

df = spark.createDataFrame([
 ["2-feb-1966"],
 ["05-mei-1974"],
 ["3-mrt-83"],
 ["05-mrt-1983"],
 ["12-jun-75"]
]).toDF("DateOfBirth")

xtupzzrd

xtupzzrd1#

month_dict = {"jan":"01", "feb":"02", "mrt":"03", "apr":"04", "mei":"05", "jun":"06", "jul":"07", "aug":"08", "sep":"09", "okt":"10", "nov":"11", "dec":"12" }
for key, item in month_dict.items():
    df= df.withColumn('column', regexp_replace('column', key, item))

字符串

v64noz0r

v64noz0r2#

下面是一个没有UDF的类似解决方案,使用when表达式进行月份转换。

month_conversion =     F.expr("""CASE 
    WHEN (month = 'jan') THEN 01 
    WHEN (month = 'feb') THEN 02 
    WHEN (month = 'mrt') THEN 03 
    WHEN (month = 'apr') THEN 04 
    WHEN (month = 'mei') THEN 05 
    WHEN (month = 'jun') THEN 06 
    WHEN (month = 'jul') THEN 07 
    WHEN (month = 'aug') THEN 08 
    WHEN (month = 'sep') THEN 09 
    WHEN (month = 'okt') THEN 10 
    WHEN (month = 'nov') THEN 11 
    WHEN (month = 'dec') THEN 12 
    ELSE NULL END
    """).alias("m")

day_conversion = F.when(F.length("day") == 1, F.concat(F.lit("0"), F.col("day"))).otherwise(F.col("day"))
year_conversion = F.when(F.length("year") < 4, F.concat(F.lit("19"), F.col("year"))).otherwise(F.col("year"))

(df.withColumn("split",
    F.split("DateOfBirth", "-")
)
 .withColumn("day",
    F.col("split").getItem(0)
)
 .withColumn("month",
    F.col("split").getItem(1)
)
.withColumn("year",
    F.col("split").getItem(2)
)
 .select(
    F.concat_ws("-",
        day_conversion,
        month_conversion,
        year_conversion
    ).alias("DateOfBirth_new")
)
.show())

字符串

jhiyze9q

jhiyze9q3#

在下面的例子中,我使用了一个Map表达式:

simple_dict={"jan":"01", "feb":"02", "mrt":"03", "apr":"04", "mei":"05", "jun":"06", "jul":"07", "aug":"08", "sep":"09", "okt":"10", "nov":"11", "dec":"12" }
mapping_expr = create_map([lit(x) for x in chain(*simple_dict.items())])
df = spark.createDataFrame([
 ["2-feb-1966"],
 ["05-mei-1974"],
 ["3-mrt-83"],
 ["05-mrt-1983"],
 ["12-jun-75"]
]).toDF("DateOfBirth")
df = df.withColumn("date",split(col("DateOfBirth"),"-").getItem(0))
df = df.withColumn("month",split(col("DateOfBirth"),"-").getItem(1))
df = df.withColumn("year",split(col("DateOfBirth"),"-").getItem(2))
df= df.withColumn("date",when(length(col("date"))==1,concat(lit("0"),col("date"))).otherwise(col("date")))
df= df.withColumn("year",when(length(col("year"))==2,concat(lit("19"),col("date"))).otherwise(col("year")))
df = df.withColumn('month', mapping_expr[df['month']])
df=df.withColumn("dates",to_date(concat(col("date"),col("month"),col("year")),"ddmmyyyy"))
df=df.drop("year","month","date")

字符串

相关问题