mysql 线图的下采样时间序列等距点

eiee3dmh  于 2022-12-28  发布在  Mysql
关注(0)|答案(2)|浏览(107)

我有一个大约有75K点的大型数据库,这个数据库用以下格式表示与国家相关的时间序列:

|country | value | date |

我使用fastapi API在我的前端提供这些数据,因为我使用的图表库不会使用超过300个点,所以我想对时间序列进行下采样,这样我就有300个等间距的点,包括每个国家的第一个点和最后一个点。
到目前为止,我已经提出了这个SQL解决方案:

WITH
    E AS (
        SELECT *,
        PERCENT_RANK() OVER w as rw
        FROM my_table
        WINDOW w AS (PARTITION BY country ORDER BY date)
    )
SELECT *, FLOOR(300*rw) as sel FROM E GROUP BY E.country, sel;

我想知道是否有更好的方法来做到这一点/更优化的方式?
我已经在使用索引了,我不确定SQL数据库上的分区是否有任何影响

示例

如果我将每个国家/地区的样本缩减至5个点,我将从以下位置开始:

+---------+--------+------------+
| country | value  | date       |
+---------+--------+------------+
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-04 |
| ar      |   1.3  | 2010-02-05 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.2  | 2010-02-07 |
| ar      |   1.4  | 2010-02-08 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.7  | 2010-02-10 |
| ar      |   1.4  | 2010-02-11 |
| ar      |   1.6  | 2010-02-12 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.5  | 2010-02-14 |
| ar      |   1.3  | 2010-02-15 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.3  | 2010-02-04 |
| fr      |   1.4  | 2010-02-05 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.9  | 2010-02-07 |
| fr      |   1.3  | 2010-02-08 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.2  | 2010-02-10 |
| fr      |   1.3  | 2010-02-11 |
| fr      |   1.5  | 2010-02-12 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-14 |
| fr      |   1.5  | 2010-02-15 |
| fr      |   1.3  | 2010-02-16 |
+---------+--------+------------+

对此:

+---------+--------+------------+
| country | value  | date       |
+---------+--------+------------+
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-16 |
+---------+--------+------------+

编辑2

有了@里克·詹姆斯的回答,我想到了这个:

SELECT country, FLOOR(300*(TO_DAYS(date) - x.da) /(x.dd - x.da)) as g, date, value
    FROM table
    JOIN ( SELECT TO_DAYS(MIN(date)) AS da, TO_DAYS(MAX(date)) as dd, country as cc
              FROM table GROUP BY country
         ) AS x
         ON table.country = x.cc
    GROUP BY country, g;

我知道这样更快,但由于我不是Maven,我不知道这是否是我能得到的最好结果/这是否是一种荒谬的方法

krugob8w

krugob8w1#

WITH
  sorted AS
(
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY country
          ORDER BY date     -- choose a different column name!
    ) - 1
      AS row_id,
    COUNT(*) OVER (
      PARTITION BY country
    ) - 1
      AS step
  FROM
    my_table
)
SELECT
  sorted.*
FROM
  sorted
WHERE
  mod( row_id             * 4, step)
  <
  mod((row_id + step - 1) * 4, step)
ORDER BY
  country, row_id

硬编码的* 4应为* (sample_size - 1)
演示:https://dbfiddle.uk/llmch7nm

编辑:

当起始数据集大小小于目标样本大小时,上述情况表现不佳。
这个演示有两个不同的fudges来解决这个问题:https://dbfiddle.uk/J0Nb_CNb

xxhby3vn

xxhby3vn2#

我将开发一个公式,将日期Map为一个数字,这样MIN(日期)Map为0.5,MAX(日期)Map为299.5。

SELECT country, ROUND(...that formula...), AVG(value)
    FROM t
    GROUP BY 1,2;

注意,AVG将有效地平滑曲线,而不是挑选特定的条目(可能特别高或特别低)。
把碎片拼在一起:

SELECT country,
       ROUND((TO_DAYS(date) - x.da) / x.span),
       AVG(value)
    FROM my_table
    JOIN ( SELECT TO_DAYS(MIN(date)) AS da,
                  ROUND((TO_DAYS(MAX(date)) -
                         TO_DAYS(MIN(date))) / 6) AS span
              FROM my_table
         ) AS x
    GROUP BY 1,2;

相关问题