根据spark中的小时数拆分时间戳间隔

wgx48brx  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(507)

根据spark中的小时数拆分时间戳

1,2019-04-01 04:00:21,12
1,2019-04-01 06:01:22,34
1,2019-04-01 09:21:23,10
1,2019-04-01 11:23:09,15
1,2019-04-01 12:02:10,15
1,2019-04-01 15:00:21,10
1,2019-04-01 18:00:22,10
1,2019-04-01 19:30:22,30
1,2019-04-01 20:22:30,30
1,2019-04-01 22:20:30,30
1,2019-04-01 23:59:00,10

以小时为基础,每6小时将时间戳分为4部分,然后求和。在这里,我像早上0点到6点,早上6点到晚上12点这样分开。

1,2019-04-01,12
1,2019-04-01,59
1,2019-04-01,25
1,2019-04-01,110
kr98yfug

kr98yfug1#

斯卡拉:我评论的帖子中的答案非常有效。

df.groupBy($"id", window($"time", "6 hours").as("time"))
  .agg(sum("count").as("count"))
  .orderBy("time.start")
  .select($"id", to_date($"time.start").as("time"), $"count")
  .show(false)

+---+----------+-----+
|id |time      |count|
+---+----------+-----+
|1  |2019-04-01|12   |
|1  |2019-04-01|59   |
|1  |2019-04-01|25   |
|1  |2019-04-01|110  |
+---+----------+-----+
3zwtqj6y

3zwtqj6y2#

试试这个-

加载测试数据

spark.conf.set("spark.sql.session.timeZone", "UTC")
    val data =
      """
        |c1,c2,c3
        |1,2019-04-01 04:00:21,12
        |1,2019-04-01 06:01:22,34
        |1,2019-04-01 09:21:23,10
        |1,2019-04-01 11:23:09,15
        |1,2019-04-01 12:02:10,15
        |1,2019-04-01 15:00:21,10
        |1,2019-04-01 18:00:22,10
        |1,2019-04-01 19:30:22,30
        |1,2019-04-01 20:22:30,30
        |1,2019-04-01 22:20:30,30
        |1,2019-04-01 23:59:00,10
      """.stripMargin
    val stringDS2 = data.split(System.lineSeparator())
      .map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df2 = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS2)
    df2.show(false)
    df2.printSchema()
    /**
      * +---+-------------------+---+
      * |c1 |c2                 |c3 |
      * +---+-------------------+---+
      * |1  |2019-03-31 22:30:21|12 |
      * |1  |2019-04-01 00:31:22|34 |
      * |1  |2019-04-01 03:51:23|10 |
      * |1  |2019-04-01 05:53:09|15 |
      * |1  |2019-04-01 06:32:10|15 |
      * |1  |2019-04-01 09:30:21|10 |
      * |1  |2019-04-01 12:30:22|10 |
      * |1  |2019-04-01 14:00:22|30 |
      * |1  |2019-04-01 14:52:30|30 |
      * |1  |2019-04-01 16:50:30|30 |
      * |1  |2019-04-01 18:29:00|10 |
      * +---+-------------------+---+
      *
      * root
      * |-- c1: integer (nullable = true)
      * |-- c2: timestamp (nullable = true)
      * |-- c3: integer (nullable = true)
      */

将日期截断为6小时,然后按groupby().sum

val seconds = 21600 // 6 hrs

    df2.withColumn("c2_long", expr(s"floor(cast(c2 as long) / $seconds) * $seconds"))
      .groupBy("c1", "c2_long")
      .agg(sum($"c3").as("c3"))
      .withColumn("c2", to_date(to_timestamp($"c2_long")))
      .withColumn("c2_time", to_timestamp($"c2_long"))
      .orderBy("c2")
      .show(false)

    /**
      * +---+----------+---+----------+-------------------+
      * |c1 |c2_long   |c3 |c2        |c2_time            |
      * +---+----------+---+----------+-------------------+
      * |1  |1554055200|12 |2019-03-31|2019-03-31 18:00:00|
      * |1  |1554120000|100|2019-04-01|2019-04-01 12:00:00|
      * |1  |1554076800|59 |2019-04-01|2019-04-01 00:00:00|
      * |1  |1554141600|10 |2019-04-01|2019-04-01 18:00:00|
      * |1  |1554098400|25 |2019-04-01|2019-04-01 06:00:00|
      * +---+----------+---+----------+-------------------+
      */

相关问题