我试图提取年龄从出生日期列在我的Dataframe(在MM/DD/YYYY格式& datatype字符串)
from pyspark.sql.functions import to_date, datediff, floor, current_date
from pyspark.sql import functions as F
from pyspark.sql.functions import col
RawData_Combined = RawData_Combined.select(col("DOB"),to_date(col("DOB"),"MM-dd-yyyy").alias("DOBFINAL"))
RawData_Combined = RawData_Combined.withColumn('AgeDOBFinal', (F.months_between(current_date(), F.col('DOBFINAL')) / 12).cast('int'))
但当我做RawData_Combined.show()
它给出以下输出
+----------+--------+-----------+
| DOB|DOBFINAL|AgeDOBFinal|
+----------+--------+-----------+
| 4/17/1989| null| null|
| 3/16/1964| null| null|
| 1/1/1970| null| null|
| 3/30/1967| null| null|
| 2/1/1989| null| null|
| 1/1/1995| null| null|
| null| null| null|
| 1/1/1976| null| null|
| null| null| null|
| 1/1/1958| null| null|
| 1/1/1960| null| null|
| 1/1/1973| null| null|
| 5/18/1988| null| null|
| null| null| null|
| 3/3/1980| null| null|
| 7/3/1988| null| null|
| 1/1/1997| null| null|
| 1/1/1961| null| null|
|10/16/1955| null| null|
| 5/5/1982| null| null|
+----------+--------+-----------+
only showing top 20 rows
1条答案
按热度按时间nhn9ugyo1#
这是可行的:
输入:
输出:
参考
'M'或'L':一年中从1开始的月份号。'M'和'L'之间没有区别。从1到9的月份打印无填充。
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html