以30分钟为间隔分割数据：Pyspark

r8xiu3jd 于 2024-01-06 发布在 Spark

关注(0)|答案(1)|浏览(174)

我需要在15分钟的时间间隔基础上拆分我的数据日历时间
例如，数据如下所示

ID      |   rh_start_time       |   rh_end_time         |   total_duration
5421833835  |   31-12-2023 13:26:53 |   31-12-2023 13:27:03 |   10
5421833961  |   31-12-2023 13:23:50 |   31-12-2023 13:39:10 |   360

字符串
我想把它分成15分钟的间隔，如下所示

ID          |   rh_start_time       |   rh_end_time         |   total_duration | Interval Start
5421833835  |   31-12-2023 13:26:53 |   31-12-2023 13:27:03 |   10 | 31-12-2023 13:00:00
5421833961  |   31-12-2023 13:23:50 |   31-12-2023 13:39:10 |   360 | 31-12-2023 13:00:00
5421833961  |   31-12-2023 13:23:50 |   31-12-2023 13:39:10 |   360 | 31-12-2023 13:30:00

型
我尝试使用explode + seq，但它以15分钟的块创建数据（例如2023-12-31 13：26：53，2023 -12-31 13：41：53），但不是在实际的日历中

intervals.withColumn(
    "rh_interval_start_ts",
    explode(expr("sequence(rh_start_time, rh_end_time, interval 30 minutes)")),
)

型

pyspark

来源：https://stackoverflow.com/questions/77745962/split-data-in-30-minute-intervals-pyspark

1条答案

按热度按时间

6xfqseft1#

一个解决方案是准备间隔并进行连接：

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
from pyspark.sql.functions import col, expr, to_timestamp
from datetime import datetime, timedelta
spark = SparkSession.builder.appName("example").getOrCreate()
# Create the DataFrame example:
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("rh_start_time", TimestampType(), True),
    StructField("rh_end_time", TimestampType(), True),
    StructField("total_duration", IntegerType(), True)
])
def to_ts(date):
    return datetime.strptime(date, "%d-%m-%Y %H:%M:%S")
data = [
    ("5421833835", to_ts("31-12-2023 13:26:53"), to_ts("31-12-2023 13:27:03"), 10),
    ("5421833961", to_ts("31-12-2023 13:23:50"), to_ts("31-12-2023 13:39:10"), 360)
]
data = spark.createDataFrame(data, schema=schema)
data.show()
# Create all dates (if necessary, you can search the min and max in data):
start_date = datetime(2023, 12, 31)
end_date = datetime(2024, 1, 1)
interval = timedelta(minutes=15)
timestamps = [start_date + i * interval for i in range(int((end_date - start_date).total_seconds() // (15 * 60) + 1))]
raw_ts = [(timestamp,) for timestamp in timestamps]
column = "interval_start"
intervals = spark.createDataFrame(raw_ts, [column])
intervals = intervals.withColumn(column, col(column).cast(TimestampType()))
intervals.show()
# Do a join:
result = data.join(intervals, on=(
    (intervals["interval_start"] >= data["rh_start_time"] - expr("INTERVAL 15 MINUTES"))
    & (intervals["interval_start"] <= data["rh_end_time"])
))
result.show()

字符串
或者，您可以在开始时间的楼层上进行分解：

from pyspark.sql.functions import col, floor, explode
data.withColumn(
    'start_floor',
    (F.floor(F.col('rh_start_time').cast('integer') / (60 * 15)) * (60 * 15)).cast('timestamp')
).withColumn(
    "interval_start",
    F.explode(F.expr("sequence(start_floor, rh_end_time, interval 15 minutes)")),
).show()

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

我来回答

以30分钟为间隔分割数据：Pyspark

1条答案

相关问题

热门标签

最新问答