我需要从具有字符串数据类型的spark列中解析以下格式的时间戳:2023-11- 17 T08:28:40.71910 +01:00当我试图把它转换成
df.withColumn("timestamp", f.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSSS Z"))
字符串它返回null。我如何处理这一点,以获得一个unix时间戳与十进制值代表微秒值?
nkcskrwz1#
时间戳(“.71910”)中的小数秒有五位数。Spark要求小数秒(毫秒)最多有三位数。超过三位数可能会导致解析错误。下面是修改后的代码。
import sysfrom pyspark import SparkContext, SQLContextfrom pyspark.sql import functions as Fimport dateutil.parsersc = SparkContext('local')sqlContext = SQLContext(sc)### This is very important setting if you want legacy behavioursqlContext.setConf("spark.sql.legacy.timeParserPolicy", "LEGACY")data1 = [["2023-11-17T08:28:40.71910 +01:00"]]df1Columns = ["ts"]df1 = sqlContext.createDataFrame(data=data1, schema=df1Columns)new_df = df1.withColumn("adjusted_ts", F.regexp_replace("ts", "(\\.[0-9]{3})[0-9]*", "$1"))new_df = new_df.withColumn("converted_ts", F.to_timestamp(F.col("adjusted_ts"), "yyyy-MM-dd'T'HH:mm:ss.SSSSS XXX"))new_df.show(truncate=False)
import sys
from pyspark import SparkContext, SQLContext
from pyspark.sql import functions as F
import dateutil.parser
sc = SparkContext('local')
sqlContext = SQLContext(sc)
### This is very important setting if you want legacy behaviour
sqlContext.setConf("spark.sql.legacy.timeParserPolicy", "LEGACY")
data1 = [["2023-11-17T08:28:40.71910 +01:00"]]
df1Columns = ["ts"]
df1 = sqlContext.createDataFrame(data=data1, schema=df1Columns)
new_df = df1.withColumn("adjusted_ts", F.regexp_replace("ts", "(\\.[0-9]{3})[0-9]*", "$1"))
new_df = new_df.withColumn("converted_ts", F.to_timestamp(F.col("adjusted_ts"), "yyyy-MM-dd'T'HH:mm:ss.SSSSS XXX"))
new_df.show(truncate=False)
字符串输出量:
+--------------------------------+------------------------------+-----------------------+|ts |adjusted_ts |converted_ts |+--------------------------------+------------------------------+-----------------------+|2023-11-17T08:28:40.71910 +01:00|2023-11-17T08:28:40.719 +01:00|2023-11-17 12:58:40.719|+--------------------------------+------------------------------+-----------------------+
+--------------------------------+------------------------------+-----------------------+
|ts |adjusted_ts |converted_ts |
|2023-11-17T08:28:40.71910 +01:00|2023-11-17T08:28:40.719 +01:00|2023-11-17 12:58:40.719|
型
1条答案
按热度按时间nkcskrwz1#
时间戳(“.71910”)中的小数秒有五位数。Spark要求小数秒(毫秒)最多有三位数。超过三位数可能会导致解析错误。
下面是修改后的代码。
字符串
输出量:
型