postgresql 为多列选择最近的非空值

ccrfmcuu  于 2024-01-07  发布在  PostgreSQL
关注(0)|答案(2)|浏览(153)

我一直在尝试优化这个查询,希望你们中的一些数据库向导能有一些见解。
使用TimescaleDB作为我的数据库,我有一个包含传感器数据的 * 宽表 *,它看起来像下面这样:
| 时间|传感器ID|风速|风向|
| --|--|--|--|
| 2023-12-18 12:15:00|第一个|NULL| 176 |
| 2023-12-18 12:13:00|第一个| 4 | 177 |
| 2023-12-18 12:11:00|第一个| 3 |NULL|
| 2023-12-18 12:09:00|第一个| 8 | 179 |
我想写一个查询,它为一组列提供最近的非空值,在sensor_id上过滤。对于上述数据(在sensor_id上过滤1),此查询应返回
| 风速|风向|
| --|--|
| 4 | 176 |
话虽如此,我的查询看起来像下面这样(当以10的批量查询sensor_id s时):

SELECT
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction,

    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_two,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_two,
    .
    .
    .
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_ten,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_ten;

字符串
我正在查询的表有1,000个唯一的sensor_id s,所有这些都以2分钟的间隔报告数据。因此,我们正在谈论数百万行
我已经在(sensor_id, time DESC)上创建了一个索引,以进一步优化查询。使用该索引,此查询的计划和执行时间分别约为400 ms50 ms
如何编写不同的查询(或添加索引)以实现最佳的计划和执行时间?

dba5bblo

dba5bblo1#

不幸的是,Posterre还没有为窗口函数实现IGNORE NULLS(截至第16页),这将允许对每个值列调用first_value()。请参阅:

  • 如何使用first_value()窗口函数获取最后一个非空值?
  • 在稀疏表中选择包含第一个非空值的行

解决方案

fiddle
有各种更短和可能(更)快的选择。
您至少应该在(ts)上有一个(部分)索引。可能在(sensor_id, ts)上。或更多。请参阅下文。所有内容取决于未公开的详细信息。
我发现timestamp列的名称“time”有误导性。请改用“ts”。

first_value() + DISTINCT ON

更短的插入式替代品。

SELECT DISTINCT ON (sensor_id)
       sensor_id
     , first_value(wind_speed    ) OVER (w ORDER BY wind_speed     IS NULL, ts DESC) AS wind_speed
     , first_value(wind_direction) OVER (w ORDER BY wind_direction IS NULL, ts DESC) AS wind_direction
--   , ... more?
FROM   sensor_data
WHERE  ts > LOCALTIMESTAMP - interval '7 days'
WINDOW w AS (PARTITION BY sensor_id);

字符串
关于DISTINCT ON

  • 是否选择每个GROUP BY组中的第一行?

子查询中的count()窗口函数+主查询中的筛选聚合

SELECT sensor_id
     , min(wind_speed)     FILTER (WHERE ws_ct = 1) AS wind_speed
     , min(wind_direction) FILTER (WHERE wd_ct = 1) AS wind_direction
--   , ... more?
FROM  (
   SELECT *
        , count(wind_speed)     OVER w AS ws_ct
        , count(wind_direction) OVER w AS wd_ct
   --   ,  ... more?
   FROM   sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   WINDOW w AS (PARTITION BY sensor_id ORDER BY ts DESC)
   ) sub
GROUP  BY sensor_id;


请参阅:

基于“传感器”表更简单

如果您也有一个表“sensor”,其中每个相关的sensor_id对应一行(您可能应该这样做),则会变得更简单:

SELECT sensor_id
    , (SELECT wind_speed     FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_speed     IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_speed
    , (SELECT wind_direction FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_direction IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_direction
--  , ... more?
FROM   sensor s
    , (SELECT LOCALTIMESTAMP - interval '7 days') t(ts_min)
;


最后一个查询(与您的详细原始查询一样)可以使用自定义索引。理想情况下,使用部分索引-每个传感器有许多行,有几个值列,有许多空值和有许多过时的行。

CREATE INDEX sensor_data_wind_speed_idx     ON sensor_data (sensor_id, ts DESC, wind_speed)
WHERE  wind_speed IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!

CREATE INDEX sensor_data_wind_direction_idx ON sensor_data (sensor_id, ts DESC, wind_direction)
WHERE  wind_direction IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!


在创建时使用一个过去一周的常量。索引的大小会随着时间的推移而增长,但仍然适用。不时地使用稍后的截止时间重新创建索引,以保持大小不受影响。(不过,不确定时间戳绑定是否为您的超表买单。普通索引可能就足够好了。我的想法是普通Posterre。)
然后运行相同的查询,但使用常量时间戳:

SELECT ...
FROM   sensor s
    , (SELECT timestamp '2023-12-12 03:47:16') t(ts_min)  -- MUST be a constant to use partial index!
;

已排序的子查询+first()集合函数

如果索引支持不是一个选项或效率不高,最方便的查询是使用聚合函数first()-如果使用附加模块first_last_agg的C版本,可能也是最快的。请参阅:

  • 从每组的第一行和最后一行获取值

每个数据库需要一次:

CREATE EXTENSION first_last_agg;
SELECT sensor_id
     , first(wind_speed    ) FILTER (WHERE wind_speed IS NOT NULL)     AS wind_speed
     , first(wind_direction) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
--   , ... more?
FROM   (
   SELECT * FROM sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   ORDER  BY sensor_id, ts DESC
   ) s
GROUP  BY 1;

的字符串

hm2xizp9

hm2xizp92#

扩展您从@ErwinBrandstetter提供的惊人列表中选择的解决方案
因为您使用的是TimescaleDB,所以实际上并不需要first_last_agg扩展,因为您已经有了一个(略有不同的)first agg。
这个查询实际上可以简化为:

SELECT sensor_id,
      last(wind_speed,ts) FILTER (WHERE wind_speed IS NOT NULL) AS wind_speed
    , last(wind_direction,ts) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
FROM  sensor_data
WHERE  ts > LOCALTIMESTAMP - interval '7 days'
GROUP BY 1;

字符串
根据你的反馈,你最初的计划时间是400毫秒,我想知道你的Timescale hypertable有多少块?我想你可能可以在这里优化!
另一个优化的途径是压缩这些数据。当我做了一个测试,我把我的数据所需的存储空间降低了8倍,我的查询速度(对于上面的查询)降低了3倍。
我通过sensor_id压缩分段,并通过时间DESC,wind_speed,wind_direction排序。

相关问题