postgresql 从大表中选择分布良好的点

ar5n3qh5  于 2022-12-12  发布在  PostgreSQL
关注(0)|答案(2)|浏览(139)

I'm trying to write a stored procedure for selecting X amount of well spread points in time from a big table.
I have a table points :

"Userid" integer
, "Time"   timestamp with time zone
, "Value"  integer

It contains hundreds of millions of records. And about a million of records per each user.
I want to select X points (lets say 50), which all well spread from time A to time B. The problem is that the points are not spread equally (if one point is in 6:00:00, the next point may be after 15 seconds, 20, or 4 minutes for example).
Selection all the points for an id can take up to 60 seconds (because there are about a million points).
Is there any way to select the exact amount of points I desire, as much well spread as possible, in a fast way?
Sample data:

+--------+---------------------+-------+
   | UserId | Time                | Value |
   +--------+---------------------+-------+
1  | 1      | 2017-04-10 14:00:00 | 1     |
2  | 1      | 2017-04-10 14:00:10 | 10    |
3  | 1      | 2017-04-10 14:00:20 | 32    |
4  | 1      | 2017-04-10 14:00:35 | 80    |
5  | 1      | 2017-04-10 14:00:58 | 101   |
6  | 1      | 2017-04-10 14:01:00 | 203   |
7  | 1      | 2017-04-10 14:01:30 | 204   |
8  | 1      | 2017-04-10 14:01:40 | 205   |
9  | 1      | 2017-04-10 14:02:02 | 32    |
10 | 1      | 2017-04-10 14:02:15 | 7     |
11 | 1      | 2017-04-10 14:02:30 | 900   |
12 | 1      | 2017-04-10 14:02:45 | 22    |
13 | 1      | 2017-04-10 14:03:00 | 34    |
14 | 1      | 2017-04-10 14:03:30 | 54    |
15 | 1      | 2017-04-10 14:04:00 | 54    |
16 | 1      | 2017-04-10 14:06:00 | 60    |
17 | 1      | 2017-04-10 14:07:20 | 654   |
18 | 1      | 2017-04-10 14:07:40 | 32    |
19 | 1      | 2017-04-10 14:08:00 | 33    |
20 | 1      | 2017-04-10 14:08:12 | 32    |
21 | 1      | 2017-04-10 14:10:00 | 8     |
   +--------+---------------------+-------+

I want to select 11 "best" points from the list above, for the user with Id 1, from time 2017-04-10 14:00:00 to 2017-04-10 14:10:00.
Currently its done on the server, after selecting all the points for the user. I calculate the "best times" by dividing the difference in times and get a list such as: 14:00:00,14:01:00,....14:10:00 (11 "best times", as the amount of points). Than I look for the closest point for each "best time", that not have been selected yet. The result will be points: 1, 6, 9, 13, 15, 16, 17, 18, 19, 20, 21

Edit:

I'm trying something like this:

SELECT * FROM "points"
WHERE "Userid" = 1 AND
(("Time" =
(SELECT "Time" FROM 
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:00:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM 
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:01:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM 
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
LIMIT 1)))

The problems here are that:
A) It doesn't take in account points that already have been selected.
B) Because of the ORDER BY , each additional time increases the running time of the query by ~ 1 second, and for 50 points I get back to the 1 minute mark.

vq8itlhq

vq8itlhq1#

There is an optimization problem behind your question that's hard to solve with just SQL.
That said, your attempt of an approximation can be implemented to use an index and show good performance irregardless of table size. You need this index if you don't have it already:

CREATE INDEX ON points ("Userid", "Time");

Query:

SELECT *
FROM   generate_series(timestamptz '2017-04-10 14:00:00+0'
                     , timestamptz '2017-04-10 14:09:00+0'  -- 1 min *before* end!
                     , interval    '1 minute') grid(t)
LEFT  JOIN LATERAL (
   SELECT *
   FROM   points
   WHERE  "Userid" = 1
   AND    "Time" >= grid.t
   AND    "Time" <  grid.t + interval '1 minute'  -- same interval
   ORDER  BY "Time"
   LIMIT  1
   ) t ON true;

Most importantly, the rewritten query can use above index and will be very fast, solving problem B).
It also addresses problem A) to some extent as no point is returned more than once. If there is no row between two adjacent points in the grid, you get no row in the result. Using LEFT JOIN .. ON true keeps all grid rows and appends NULL in this case. Eliminate those NULL rows by switching to CROSS JOIN . You may get fewer result rows this way.
I am only search ahead of each grid point. You might append a second LATERAL join to also search behind each grid point (just another index-scan), and take the closer one of the two results (ignoring NULL). But that introduces two problems:

  • If one match is behind and the next is ahead, the gap widens.
  • You need special treatment for lower and / or upper bound of the outer interval
  • And you need two LATERAL joins with two index scans.

You could use a recursive CTE to search 1 minute ahead of the last time actually found, but then the total number of rows returned varies even more.
It all comes down to an exact definition of what you need, and where compromises are allowed.
Related:

  • What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
  • Aggregating the most recent joined records per week
  • MySQL/Postgres query 5 minutes interval data
  • Optimize GROUP BY query to retrieve latest row per user
cbjzeqam

cbjzeqam2#

答案使用generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval)join进行比较。
为他人保存数据集时间:

t=# create table points(i int,"UserId" int,"Time" timestamp(0), "Value" int,b text);
CREATE TABLE
Time: 13.728 ms
t=# copy points from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1  | 1      | 2017-04-10 14:00:00 | 1     |
>> 2  | 1      | 2017-04-10 14:00:10 | 10    |
3  | 1      | 2017-04-10 14:00:20 | 32    |
4  | 1      | 2017-04-10 14:00:35 | 80    |
5  | 1      | 2017-04-10 14:00:58 | 101   |
6  | 1      | 2017-04-10 14:01:00 | 203   |
7  | 1      | 2017-04-10 14:01:30 | >> 204   |
8  | 1      | 2017-04-10 14:01:40 | 205   |
9  | 1      | 2017-04-10 14:02:02 | 32    |
10 | 1      | 2017-04-10 14:02:15 | 7     |
11 | 1      | 2017-04-10 14:02:30 | 900   |
12 | 1      | 2017-04-10 14:02:45 | 22    |
>> >> >> >> >> >> >> >> >> >> 13 | 1      | 2017-04-10 14:03:00 | 34    |
14 | 1      | 2017-04-10 14:03:30 | 54    |
15 | 1      | 2017-04-10 14:04:00 | 54    |
16 | 1      | 2017-04-10 14:06:00 | 60    |
17 | 1      | 2017-04-10 14:07:20 | 654   |
18 | 1      | 2017-04-10 14:07:40 | 32    |
19 | 1      | 2017-04-10 14:08:00 | 33    |
20 | 1      | 2017-04-10 14:08:12 | 32    |
21 | 1      | 2017-04-10 14:10:00 | 8     |>> >> >> >> >> >> >> >> \.
>> \.
COPY 21
Time: 7684.259 ms
t=# alter table points rename column "UserId" to "Userid";
ALTER TABLE
Time: 1.013 ms

坦白地说,我不明白这个要求。这是我从描述中得到的,结果与OP预期的不同:

t=# with r as (
  with g as (
    select generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval) s
  )
  select *,abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
  from g
  join points on g.s = date_trunc('minute',"Time")
  order by abs
  limit 11
)
select i, "Time","Value",abs
from r
order by i;
 i  |        Time         | Value | abs
----+---------------------+-------+-----
  4 | 2017-04-10 14:00:35 |    80 |  85
  5 | 2017-04-10 14:00:58 |   101 |  62
  6 | 2017-04-10 14:01:00 |   203 |  60
  7 | 2017-04-10 14:01:30 |   204 |  30
  8 | 2017-04-10 14:01:40 |   205 |  20
  9 | 2017-04-10 14:02:02 |    32 |   2
 10 | 2017-04-10 14:02:15 |     7 |  15
 11 | 2017-04-10 14:02:30 |   900 |  30
 12 | 2017-04-10 14:02:45 |    22 |  45
 13 | 2017-04-10 14:03:00 |    34 |  60
 14 | 2017-04-10 14:03:30 |    54 |  90
(11 rows)

我添加了abs列以证明为什么我认为这些行更适合请求

相关问题