I'm trying to write a stored procedure for selecting X amount of well spread points in time from a big table.
I have a table points
:
"Userid" integer
, "Time" timestamp with time zone
, "Value" integer
It contains hundreds of millions of records. And about a million of records per each user.
I want to select X points (lets say 50), which all well spread from time A to time B. The problem is that the points are not spread equally (if one point is in 6:00:00, the next point may be after 15 seconds, 20, or 4 minutes for example).
Selection all the points for an id can take up to 60 seconds (because there are about a million points).
Is there any way to select the exact amount of points I desire, as much well spread as possible, in a fast way?
Sample data:
+--------+---------------------+-------+
| UserId | Time | Value |
+--------+---------------------+-------+
1 | 1 | 2017-04-10 14:00:00 | 1 |
2 | 1 | 2017-04-10 14:00:10 | 10 |
3 | 1 | 2017-04-10 14:00:20 | 32 |
4 | 1 | 2017-04-10 14:00:35 | 80 |
5 | 1 | 2017-04-10 14:00:58 | 101 |
6 | 1 | 2017-04-10 14:01:00 | 203 |
7 | 1 | 2017-04-10 14:01:30 | 204 |
8 | 1 | 2017-04-10 14:01:40 | 205 |
9 | 1 | 2017-04-10 14:02:02 | 32 |
10 | 1 | 2017-04-10 14:02:15 | 7 |
11 | 1 | 2017-04-10 14:02:30 | 900 |
12 | 1 | 2017-04-10 14:02:45 | 22 |
13 | 1 | 2017-04-10 14:03:00 | 34 |
14 | 1 | 2017-04-10 14:03:30 | 54 |
15 | 1 | 2017-04-10 14:04:00 | 54 |
16 | 1 | 2017-04-10 14:06:00 | 60 |
17 | 1 | 2017-04-10 14:07:20 | 654 |
18 | 1 | 2017-04-10 14:07:40 | 32 |
19 | 1 | 2017-04-10 14:08:00 | 33 |
20 | 1 | 2017-04-10 14:08:12 | 32 |
21 | 1 | 2017-04-10 14:10:00 | 8 |
+--------+---------------------+-------+
I want to select 11 "best" points from the list above, for the user with Id 1, from time 2017-04-10 14:00:00 to 2017-04-10 14:10:00.
Currently its done on the server, after selecting all the points for the user. I calculate the "best times" by dividing the difference in times and get a list such as: 14:00:00,14:01:00,....14:10:00 (11 "best times", as the amount of points). Than I look for the closest point for each "best time", that not have been selected yet. The result will be points: 1, 6, 9, 13, 15, 16, 17, 18, 19, 20, 21
Edit:
I'm trying something like this:
SELECT * FROM "points"
WHERE "Userid" = 1 AND
(("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:00:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:01:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
LIMIT 1)))
The problems here are that:
A) It doesn't take in account points that already have been selected.
B) Because of the ORDER BY
, each additional time increases the running time of the query by ~ 1 second, and for 50 points I get back to the 1 minute mark.
2条答案
按热度按时间vq8itlhq1#
There is an optimization problem behind your question that's hard to solve with just SQL.
That said, your attempt of an approximation can be implemented to use an index and show good performance irregardless of table size. You need this index if you don't have it already:
Query:
Most importantly, the rewritten query can use above index and will be very fast, solving problem B).
It also addresses problem A) to some extent as no point is returned more than once. If there is no row between two adjacent points in the grid, you get no row in the result. Using
LEFT JOIN .. ON true
keeps all grid rows and appends NULL in this case. Eliminate those NULL rows by switching toCROSS JOIN
. You may get fewer result rows this way.I am only search ahead of each grid point. You might append a second
LATERAL
join to also search behind each grid point (just another index-scan), and take the closer one of the two results (ignoring NULL). But that introduces two problems:LATERAL
joins with two index scans.You could use a recursive CTE to search 1 minute ahead of the last time actually found, but then the total number of rows returned varies even more.
It all comes down to an exact definition of what you need, and where compromises are allowed.
Related:
cbjzeqam2#
答案使用
generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval)
和join
进行比较。为他人保存数据集时间:
坦白地说,我不明白这个要求。这是我从描述中得到的,结果与OP预期的不同:
我添加了abs列以证明为什么我认为这些行更适合请求