我有一个大约有75K点的大型数据库,这个数据库用以下格式表示与国家相关的时间序列:
|country | value | date |
我使用fastapi API在我的前端提供这些数据,因为我使用的图表库不会使用超过300个点,所以我想对时间序列进行下采样,这样我就有300个等间距的点,包括每个国家的第一个点和最后一个点。
到目前为止,我已经提出了这个SQL解决方案:
WITH
E AS (
SELECT *,
PERCENT_RANK() OVER w as rw
FROM my_table
WINDOW w AS (PARTITION BY country ORDER BY date)
)
SELECT *, FLOOR(300*rw) as sel FROM E GROUP BY E.country, sel;
我想知道是否有更好的方法来做到这一点/更优化的方式?
我已经在使用索引了,我不确定SQL数据库上的分区是否有任何影响
示例
如果我将每个国家/地区的样本缩减至5个点,我将从以下位置开始:
+---------+--------+------------+
| country | value | date |
+---------+--------+------------+
| ar | 1.4 | 2010-02-03 |
| ar | 1.4 | 2010-02-04 |
| ar | 1.3 | 2010-02-05 |
| ar | 1.4 | 2010-02-06 |
| ar | 1.2 | 2010-02-07 |
| ar | 1.4 | 2010-02-08 |
| ar | 1.5 | 2010-02-09 |
| ar | 1.7 | 2010-02-10 |
| ar | 1.4 | 2010-02-11 |
| ar | 1.6 | 2010-02-12 |
| ar | 1.4 | 2010-02-13 |
| ar | 1.5 | 2010-02-14 |
| ar | 1.3 | 2010-02-15 |
| ar | 1.2 | 2010-02-16 |
| fr | 1.3 | 2010-02-03 |
| fr | 1.3 | 2010-02-04 |
| fr | 1.4 | 2010-02-05 |
| fr | 1.6 | 2010-02-06 |
| fr | 1.9 | 2010-02-07 |
| fr | 1.3 | 2010-02-08 |
| fr | 1.3 | 2010-02-09 |
| fr | 1.2 | 2010-02-10 |
| fr | 1.3 | 2010-02-11 |
| fr | 1.5 | 2010-02-12 |
| fr | 1.3 | 2010-02-13 |
| fr | 1.3 | 2010-02-14 |
| fr | 1.5 | 2010-02-15 |
| fr | 1.3 | 2010-02-16 |
+---------+--------+------------+
对此:
+---------+--------+------------+
| country | value | date |
+---------+--------+------------+
| ar | 1.4 | 2010-02-03 |
| ar | 1.4 | 2010-02-06 |
| ar | 1.5 | 2010-02-09 |
| ar | 1.4 | 2010-02-13 |
| ar | 1.2 | 2010-02-16 |
| fr | 1.3 | 2010-02-03 |
| fr | 1.6 | 2010-02-06 |
| fr | 1.3 | 2010-02-09 |
| fr | 1.3 | 2010-02-13 |
| fr | 1.3 | 2010-02-16 |
+---------+--------+------------+
编辑2
有了@里克·詹姆斯的回答,我想到了这个:
SELECT country, FLOOR(300*(TO_DAYS(date) - x.da) /(x.dd - x.da)) as g, date, value
FROM table
JOIN ( SELECT TO_DAYS(MIN(date)) AS da, TO_DAYS(MAX(date)) as dd, country as cc
FROM table GROUP BY country
) AS x
ON table.country = x.cc
GROUP BY country, g;
我知道这样更快,但由于我不是Maven,我不知道这是否是我能得到的最好结果/这是否是一种荒谬的方法
2条答案
按热度按时间krugob8w1#
硬编码的
* 4
应为* (sample_size - 1)
演示:https://dbfiddle.uk/llmch7nm
编辑:
当起始数据集大小小于目标样本大小时,上述情况表现不佳。
这个演示有两个不同的fudges来解决这个问题:https://dbfiddle.uk/J0Nb_CNb
xxhby3vn2#
我将开发一个公式,将日期Map为一个数字,这样MIN(日期)Map为0.5,MAX(日期)Map为299.5。
注意,
AVG
将有效地平滑曲线,而不是挑选特定的条目(可能特别高或特别低)。把碎片拼在一起: