我有pysparkDataframe,其中有时间戳列,我想减少1毫秒的时间戳。spark中是否有一些内置函数可用于处理此类场景?例如,timestamp列的值:timestamp value:2020-07-13 17:29:36
vyswwuz21#
还可以将interval与expr一起使用。
import pyspark.sql.functions as Fdf = spark.createDataFrame( [ (1, '2020-07-13 17:29:36') ], [ 'id', 'time' ])df.withColumn( 'time', F.col('time').cast('timestamp')).withColumn( 'timediff', ( F.col('time') - F.expr('INTERVAL 1 milliseconds') ).cast('timestamp') ).show(truncate=False)
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
(1, '2020-07-13 17:29:36')
],
'id', 'time'
]
)
df.withColumn(
'time',
F.col('time').cast('timestamp')
).withColumn(
'timediff',
(
F.col('time') - F.expr('INTERVAL 1 milliseconds')
).cast('timestamp')
).show(truncate=False)
r7s23pms2#
通过使用double type,您可以做到这一点。
import pyspark.sql.functions as fdf = spark.createDataFrame([(1, '2020-07-13 17:29:36')], ['id', 'time'])df.withColumn('time', f.to_timestamp('time', 'yyyy-MM-dd HH:mm:ss')) \ .withColumn('timediff', (f.col('time').cast('double') - f.lit(0.001)).cast('timestamp')) \ .show(10, False)+---+-------------------+-----------------------+|id |time |timediff |+---+-------------------+-----------------------+|1 |2020-07-13 17:29:36|2020-07-13 17:29:35.999|+---+-------------------+-----------------------+
import pyspark.sql.functions as f
df = spark.createDataFrame([(1, '2020-07-13 17:29:36')], ['id', 'time'])
df.withColumn('time', f.to_timestamp('time', 'yyyy-MM-dd HH:mm:ss')) \
.withColumn('timediff', (f.col('time').cast('double') - f.lit(0.001)).cast('timestamp')) \
.show(10, False)
+---+-------------------+-----------------------+
|id |time |timediff |
|1 |2020-07-13 17:29:36|2020-07-13 17:29:35.999|
cgh8pdjw3#
你可以用 pyspark.sql.functions.expr 减去 INTERVAL 1 milliseconds ```from pyspark.sql.functions import expr
pyspark.sql.functions.expr
INTERVAL 1 milliseconds
df = spark.createDataFrame([('2020-07-13 17:29:36',)], ['time'])df = df.withColumn('time2', expr("time - INTERVAL 1 milliseconds"))df.show(truncate=False)
即使 `time` 是这种格式的字符串,spark将为您进行隐式转换。
df.printSchema()
3条答案
按热度按时间vyswwuz21#
还可以将interval与expr一起使用。
r7s23pms2#
通过使用double type,您可以做到这一点。
cgh8pdjw3#
你可以用
pyspark.sql.functions.expr
减去INTERVAL 1 milliseconds
```from pyspark.sql.functions import expr
df = spark.createDataFrame([('2020-07-13 17:29:36',)], ['time'])
df = df.withColumn('time2', expr("time - INTERVAL 1 milliseconds"))
df.show(truncate=False)
+-------------------+-----------------------+
|time |time2 |
+-------------------+-----------------------+
|2020-07-13 17:29:36|2020-07-13 17:29:35.999|
+-------------------+-----------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time2: string (nullable = true)