python 在PySpark中将DOB转换为Age

ddrv8njm  于 2023-03-21  发布在  Python
关注(0)|答案(1)|浏览(130)

我试图提取年龄从出生日期列在我的Dataframe(在MM/DD/YYYY格式& datatype字符串)

from pyspark.sql.functions import to_date, datediff, floor, current_date
from pyspark.sql import functions as F
from pyspark.sql.functions import col

RawData_Combined = RawData_Combined.select(col("DOB"),to_date(col("DOB"),"MM-dd-yyyy").alias("DOBFINAL"))

RawData_Combined = RawData_Combined.withColumn('AgeDOBFinal', (F.months_between(current_date(), F.col('DOBFINAL')) / 12).cast('int'))

但当我做RawData_Combined.show()
它给出以下输出

+----------+--------+-----------+
|       DOB|DOBFINAL|AgeDOBFinal|
+----------+--------+-----------+
| 4/17/1989|    null|       null|
| 3/16/1964|    null|       null|
|  1/1/1970|    null|       null|
| 3/30/1967|    null|       null|
|  2/1/1989|    null|       null|
|  1/1/1995|    null|       null|
|      null|    null|       null|
|  1/1/1976|    null|       null|
|      null|    null|       null|
|  1/1/1958|    null|       null|
|  1/1/1960|    null|       null|
|  1/1/1973|    null|       null|
| 5/18/1988|    null|       null|
|      null|    null|       null|
|  3/3/1980|    null|       null|
|  7/3/1988|    null|       null|
|  1/1/1997|    null|       null|
|  1/1/1961|    null|       null|
|10/16/1955|    null|       null|
|  5/5/1982|    null|       null|
+----------+--------+-----------+
only showing top 20 rows
nhn9ugyo

nhn9ugyo1#

这是可行的:

df.withColumn("DOBFINAL", F.to_date(F.col("DOB"),"M/dd/yyyy"))\
.withColumn('AgeDOBFinal', (F.months_between(F.current_date(), F.col('DOBFINAL')) / 12).cast('int'))\
.show()

输入:

+----------+
|       DOB|
+----------+
| 4/17/1989|
|10/16/1955|
+----------+

输出:

+----------+----------+-----------+
|       DOB|  DOBFINAL|AgeDOBFinal|
+----------+----------+-----------+
| 4/17/1989|1989-04-17|         33|
|10/16/1955|1955-10-16|         67|
+----------+----------+-----------+

参考
'M'或'L':一年中从1开始的月份号。'M'和'L'之间没有区别。从1到9的月份打印无填充。
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

相关问题