PostgreSQL -通过距离阈值压缩相邻行的性能

ddarikpa 于 2023-11-18 发布在 PostgreSQL

关注(0)|答案(1)|浏览(105)

上下文

下面是我的问题的简化版本
我们得到了一个名为positions的表，它存储了一个项目在几个容器中的移动。
每个记录包含

容器的名称（我们称之为container）
两个日期时间属性，分别为date_from和date_to，包含项目进入和离开容器的时间戳

在两个连续的记录之间可能存在“时间间隙”。即，该项目在容器A中直到上午10点，然后它在下午4点出现在容器B中，其间没有任何东西。
以下是一个示例数据集
| ID| container个|date_from个|date_to个|
| --|--|--|--|
| 1 |一|2023-10-01T00：00：00| 2023-10-01T10：00：00|
| 2 |一|2023-10-03T09：00：00| 2023-10-03T11：00：00|
| 3 |B| 2023-10-04T02：00：00| 2023-10-04T03：00：00|
| 4 |C| 2023-10-04T06：00：00| 2023-10-04T08：00：00|
| 5 |C| 2023-10-05T00：00：00| 2023-10-06T10：00：00|
| 6 |一|2023-10-06T11：00：00| 2023-10-06T20：00：00|
| 7 |C| 2023-10-06T21：00：00| 2023-10-07T10：00：00|

需求

我需要压缩所有相邻的位置，
1.在同一个容器中（在子序列中，项永远不会离开该容器）
1.并且彼此“足够接近”：即，对于第二位置的date_from，在距前一位置的date_to的特定时间阈值内。
对于我压缩的每个子序列，我需要取date_from的第一个值和date_to的最后一个值，并将它们放在同一结果行中。
例如，如果容器A中有5个连续的记录，并且根据规则它们足够接近以被压扁，那么我压扁这些位置的最后一行将具有

container = A
date_from取自我压扁的5个位置中的第一个
date_to取自5个位置中的最后一个

我写的PostgreSQL查询

WITH with_next_position AS (
      SELECT
        id,
        container,
        date_from,
        date_to,
        (
          SELECT subquery.id
          FROM positions subquery
          WHERE subquery.date_from > base.date_from
          ORDER BY subquery.date_from ASC
          LIMIT 1
        ) AS next_position_id

      FROM positions
    ),

    with_time_lapse AS (
      SELECT
        with_next_position.date_from AS date_from,
        with_next_position.date_to AS date_from,
        with_next_position.container AS container,
        CASE
        WHEN join_table.date_from IS NOT NULL
          THEN EXTRACT(EPOCH FROM (join_table.date_from - with_next_position.date_to))
        ELSE
          NULL
        END AS time_lapse,
        join_table.marina_id AS next_container

      FROM
        with_next_position
        FULL OUTER JOIN with_next_position join_table ON join_table.id = with_next_position.next_position_id

      WHERE
        with_next_position.container IS NOT NULL
    ),

    with_marked_to_squash AS (
      SELECT
        date_from,
        date_to,
        container,
        CASE
        WHEN next_container = container AND time_lapse <= 10000000 # This is where I put the threshold
          THEN TRUE
        ELSE
          FALSE
        END AS to_squash

      FROM with_time_lapse
    )

    with_marked_first_to_squash AS (
      SELECT
        date_from,
        date_to,
        container,
        CASE
        WHEN to_squash
          THEN (
            SELECT CASE WHEN to_squash THEN FALSE ELSE TRUE END
            FROM with_marked_to_squash subquery
            WHERE subquery.date_from < with_marked_to_squash.date_from
            ORDER BY subquery.date_from DESC
            LIMIT 1
          )
        ELSE
          FALSE
        END AS first_to_squash

      FROM with_marked_to_squash
    ),

    with_first_to_squash AS (
      SELECT
        date_from,
        date_to,
        container,
        (
          SELECT subquery.date_from
          FROM with_marked_first_to_squash subquery
          WHERE subquery.date_from < with_marked_first_to_squash.date_from AND first_to_squash IS TRUE
          ORDER BY subquery.date_from DESC
          LIMIT 1
        ) AS first_date_in_position

      FROM with_marked_first_to_squash

      WHERE to_squash IS FALSE
    )

    SELECT
      COALESCE(first_date_in_position, date_from) AS date_from,
      date_to,
      container
      EXTRACT(EPOCH FROM (date_to - COALESCE(first_date_in_position, date_from))) AS time_spent

    FROM with_first_to_squash

    ORDER BY date_from

字符串

性能问题

上面的查询是正确的，它做了我期望它做的事情。然而，在提取子查询with_first_to_squash时出现了性能问题。如果我将查询切割到with_first_to_squash之前，性能将呈指数级提高。
我认为性能问题的原因是，通过连续运行with_marked_first_to_squash和with_first_to_squash，我使数据库引擎经历了两个嵌套循环：

首先，我们将那些已经被标记为“to_squash”的位置标记为“first_to_squash”，并且这些位置是同类中的第一个（即，前一个位置没有被标记为“to_squash”）：这是通过内联子查询（在with_marked_first_to_squash的定义中）来完成的。
其次，我们只选择不会被压扁的位置（即每个相邻子序列中的最后一个），并且对于它们中的每一个，我们运行一个子查询，该子查询“返回”直到过去被标记为“first_to_squash”的第一个位置：一旦找到该位置，我们就使用它来检索date_from

在我删除第二子查询的那一刻，事情变得非常迅速。
我相信有一个解决方案可以从子序列的第一个位置提取date_from，可能涉及到分区，但我不熟悉分区和它们的语法。有没有人可以给予提示？

postgresql

来源：https://stackoverflow.com/questions/77391036/postgresql-performance-of-squashing-adjacent-rows-by-a-distance-threshold

1条答案

按热度按时间

cyej8jka1#

我怀疑select列表中的子查询是导致性能下降的原因。
请尝试以下窗口函数解决您的缺口和孤岛问题，因为它只需要排序一次：

with squashes as (
  select *,
         case
           when     container = lag(container) over w
                and date_from - lag(date_to) over w <= interval '5 days' then false 
           else true
         end as keep_me
    from positions
  window w as (order by date_from)
), islands as (
  select *, sum(keep_me::int) over (order by date_from) as group_num
    from squashes
)
select container, min(date_from) as date_from, max(date_to) as date_to
  from islands
 group by group_num, container
 order by group_num;

字符串
工作fiddle

赞(0）回复(0）举报 2023-11-18

我来回答

PostgreSQL -通过距离阈值压缩相邻行的性能

上下文

需求

我写的PostgreSQL查询

性能问题

1条答案

相关问题

热门标签

最新问答