PostgreSQL:获取每个时间间隔的最新行

7cjasjjr  于 2023-02-12  发布在  PostgreSQL
关注(0)|答案(3)|浏览(146)

我有下面的表。它被存储为一个TimescaleDB超表。数据速率是每秒1行。

CREATE TABLE electricity_data
(
    "time" timestamptz NOT NULL,
    meter_id integer REFERENCES meters NOT NULL,
    import_low double precision,
    import_normal double precision,
    export_low double precision,
    export_normal double precision,
    PRIMARY KEY ("time", meter_id)
)

我想获取一段时间内给定时间间隔内的最新行。例如,上一年每个月的最新记录。下面的查询可以工作,但速度很慢:
x一个一个一个一个x一个一个二个x
获取一个月的最新行是即时的:
一个三个三个一个
是否有一种方法可以在每个月或自定义时间间隔执行上述查询?或者是否有一种不同的方法可以加快第一次查询的速度?

    • 编辑**

@O. Jones的回答很棒,下面的查询花了10秒钟,好多了,但还是比手动方式慢,索引似乎对性能没有什么影响。

EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
    AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
    AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');
Finalize GroupAggregate  (cost=415875.14..415901.64 rows=200 width=16) (actual time=9824.342..9873.903 rows=12 loops=1)
   Group Key: (time_bucket('1 mon'::interval, _hyper_12_65_chunk."time", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval))
   ->  Gather Merge  (cost=415875.14..415898.14 rows=200 width=16) (actual time=9824.317..9873.872 rows=17 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=414875.13..414875.63 rows=200 width=16) (actual time=9745.705..9745.873 rows=8 loops=2)
               Sort Key: (time_bucket('1 mon'::interval, _hyper_12_65_chunk."time", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval))
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 25kB
               ->  Partial HashAggregate  (cost=414864.99..414867.49 rows=200 width=16) (actual time=9745.636..9745.806 rows=8 loops=2)
                     Group Key: time_bucket('1 mon'::interval, _hyper_12_65_chunk."time", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval)
                     Batches: 1  Memory Usage: 40kB
                     Worker 0:  Batches: 1  Memory Usage: 40kB
                     ->  Result  (cost=0.42..381528.98 rows=6667202 width=16) (actual time=214.980..8603.121 rows=5580737 loops=2)
                           ->  Parallel Append  (cost=0.42..298188.95 rows=6667202 width=8) (actual time=214.801..2386.453 rows=5580737 loops=2)
                                 ->  Parallel Index Only Scan using "65_76_electricity_data_pkey" on _hyper_12_65_chunk  (cost=0.42..15430.83 rows=389456 width=8) (actual time=206.480..354.445 rows=604505 loops=1)
                                       Index Cond: (("time" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND ("time" < '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
                                       Heap Fetches: 5
(...)
                                 ->  Parallel Index Scan using _hyper_12_79_chunk_meter_time_bucket on _hyper_12_79_chunk  (cost=0.15..15.03 rows=198 width=8) (actual time=0.030..0.185 rows=336 loops=1)
                                       Index Cond: (meter_id = 1)
                                       Filter: (("time" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND ("time" < '2022-12-31 23:00:00+00'::timestamp with time zone))
(...)
 Planning Time: 50.463 ms
 JIT:
   Functions: 451
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
 Execution Time: 9910.058 ms
rekjcdws

rekjcdws1#

我建议使用last aggregatecontinuous aggregate来解决这个问题。
和前面的海报一样,我也推荐一个关于米、时间的索引,而不是相反,你可以在你的表定义中通过改变主键定义中键的顺序来实现这一点。

CREATE TABLE electricity_data
(
    "time" timestamptz NOT NULL,
    meter_id integer REFERENCES meters NOT NULL,
    import_low double precision,
    import_normal double precision,
    export_low double precision,
    export_normal double precision,
    PRIMARY KEY ( meter_id, "time")
);

但这有点离题了,你要做的基本查询如下:

SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'), 
    meter_id, 
    last(electricity_data, "time") 
FROM electricity_data 
GROUP BY 1, 2;

这有点令人困惑,直到你意识到表本身在PostgreSQL中也是一个 type--所以你可以从对last聚合的调用中请求并返回一个复合类型,它将获得月或日或任何你想要的最新值。
然后你必须能够再次把它当作一行,所以你可以用括号和一个.* 来扩展它,这就是复合类型在PG中扩展的方式。

SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
    meter_id, 
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1,2;

现在,为了加快速度,你可以把它变成一个连续的集合,这将使事情更快。

CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1, meter_id;

您会注意到,我从初始选择列表中取出了meter_id,因为它将来自我们的复合类型,并且我不需要多余的列,也不能在视图中有两个同名的列,但我确实将meter_id保留在了组中。
这将大大加快速度,但是,如果我是您,我可能会考虑每天都这样做,并为这类事情创建一个分层的连续聚合。

CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1, meter_id;

CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
    (last(last_meter_day, time_bucket)).*
FROM last_meter_day 
GROUP BY 1, meter_id;

原因是我们实际上无法经常刷新每月连续聚合,刷新每日聚合然后更频繁地将其上滚到每月聚合要容易得多。您也可以在查询中“仅”拥有每日聚合并即时上滚到每月,因为这最多是每米30天,但当然不会有那么高的性能。
然后,您必须根据您希望在刷新时发生的情况对这些文件执行create continuous aggregate policies操作。
我还建议,取决于你想做什么,你可能想看看counter_agg,因为它可能对你有用,我最近还在我们的论坛上写了一篇关于how to use it with electricity meters的文章,这可能对你有帮助,取决于你如何处理这些数据。

ni65a41a

ni65a41a2#

您可以尝试使用子查询来获取每个存储桶中最近时间的时间戳,然后将其连接到详细信息表。

SELECT meter_id, MAX("time") "time"
          FROM electricity_data
          WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
            AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
          GROUP BY meter_id, 
                   time_bucket('1 month', "time", 'Europe/Amsterdam')

这会得到一个虚拟表,其中包含每个时段(在本例中为月)的每个 Jmeter 的最新时间。使用此索引可以加快速度,它与主键相同,但列的顺序相反。使用此顺序的列,查询可以通过相对较快的索引扫描来满足。

CREATE INDEX meter_time ON electricity_data (meter_id, "time")

然后将其连接到明细表。就像这样。

SELECT d.meter_id
       time_bucket('1 month', d."time", 'Europe/Amsterdam') AS bucket,
       d."time",
       d.import_low,
       d.import_normal,
       d.export_low,
       d.export_normal
  FROM electricity_data d
  JOIN (
        SELECT meter_id, MAX("time") "time"
          FROM electricity_data
          WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
            AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
          GROUP BY meter_id, 
                   time_bucket('1 month', "time", 'Europe/Amsterdam')
       ) last ON d."time" = last."time" 
             AND d.meter_id = last.meter_id
 ORDER BY d.meter_id, bucket DESC

(对于与保留字(如time)同名的列,我不完全确定TimeScaleDB中的语法,因此没有进行测试。)
如果只需要一个 Jmeter ,请在最后一个ORDER BY子句之前放置WHERE子句。

zd287kbt

zd287kbt3#

我认为@davidk提出的使用递归CTE的建议会给我带来我想要的东西。
在给定开始日期、结束日期和任意间隔的情况下,生成所有间隔是非常简单的,同时还能保持日历月份正常工作。

  • 开始日期:2022-01-01T00:00 Europe/Amsterdam
  • 结束日期:2023-01-01T00:00 Europe/Amsterdam
  • 时间间隔:1 month
WITH RECURSIVE t(n, l, r) AS (
    VALUES (0, NULL::TIMESTAMPTZ, NULL::TIMESTAMPTZ)
    UNION
    SELECT
        n+1,
        (TIMESTAMPTZ '2022-01-01 Europe/Amsterdam') + (INTERVAL '1 month')*n,
        (TIMESTAMPTZ '2022-01-01 Europe/Amsterdam') + (INTERVAL '1 month')*(n+1)
    FROM t
    WHERE TIMESTAMPTZ '2022-01-01 Europe/Amsterdam' + (INTERVAL '1 month'*n) < TIMESTAMPTZ '2023-01-01 Europe/Amsterdam'
)
SELECT n, l, r FROM t;
n  |           l            |           r            
----+------------------------+------------------------
  0 |                        | 
  1 | 2021-12-31 23:00:00+00 | 2022-01-31 23:00:00+00
  2 | 2022-01-31 23:00:00+00 | 2022-02-28 23:00:00+00
  3 | 2022-02-28 23:00:00+00 | 2022-03-31 23:00:00+00
  4 | 2022-03-31 23:00:00+00 | 2022-04-30 23:00:00+00
  5 | 2022-04-30 23:00:00+00 | 2022-05-31 23:00:00+00
  6 | 2022-05-31 23:00:00+00 | 2022-06-30 23:00:00+00
  7 | 2022-06-30 23:00:00+00 | 2022-07-31 23:00:00+00
  8 | 2022-07-31 23:00:00+00 | 2022-08-31 23:00:00+00
  9 | 2022-08-31 23:00:00+00 | 2022-09-30 23:00:00+00
 10 | 2022-09-30 23:00:00+00 | 2022-10-31 23:00:00+00
 11 | 2022-10-31 23:00:00+00 | 2022-11-30 23:00:00+00
 12 | 2022-11-30 23:00:00+00 | 2022-12-31 23:00:00+00
(13 rows)

这是一个快速writeup,我会编辑或删除此答案.

相关问题