我有一个类似
的表
创建相同表的查询:
with t as (
select *
from (
(
values ('james', '801xxxxxxx', 'james@gmail.com', 'ca', 'a'),
('robert', '714xxxxxxx', '', 'ca', 'a'),
('william', '', 'william@gmail.com', '', 'a'),
('maria', '1234567890', 'maria@gmail.com', '', 'a'),
('richard', '', 'richard@gmail.com', '', 'a'),
('', '', 'james@gmail.com', '', 'b'),
('maria', '1234567890', '', '', 'b'),
('robert', '', '', 'ca', 'b')
)
) t (first_name, phone, email, state, "type")
), a_t as (
select *
from t
where "type" = 'a'
), b_t as (
select *
from t
where "type" = 'b'
)
select *
from t
如何使用三组规则删除不同type
之间的任何重复项。规则是在以下情况下删除重复项:
email
匹配phone
和first_name
同时匹配state
和first_name
同时匹配。
结果应类似于
我尝试过不同的方法,最新的方法是创建两个表a_t
和b_t
,其中a_t
是type
等于a
的表,b_t
是type
等于b
的表,然后尝试使用不同类型的连接来合并这两个表。
[EDIT 1]
为了听从@topsail的建议,我试了下面这些网站和其他一些网站,没有一个不符合我的要求。
with t as (
select *
from (
(
values ('james', '801xxxxxxx', 'james@gmail.com', 'ca', 'a'),
('robert', '714xxxxxxx', '', 'ca', 'a'),
('william', '', 'william@gmail.com', '', 'a'),
('maria', '1234567890', 'maria@gmail.com', '', 'a'),
('richard', '', 'richard@gmail.com', '', 'a'),
('', '', 'james@gmail.com', '', 'b'),
('maria', '1234567890', '', '', 'b'),
('robert', '', '', 'ca', 'b')
)
) t (first_name, phone, email, state, "type")
),
dedupe_one as (
select distinct on (email)
first_name, phone, email, state, "type"
from t
),
dedupe_two as (
select distinct on (phone, first_name)
first_name, phone, email, state, "type"
from t
),
dedupe_three as (
select distinct on (state, first_name)
first_name, phone, email, state, "type"
from t
),
dedupe_four as (
select distinct on (email) *
from t
union
select distinct on (phone, first_name) *
from t
union
select distinct on (state, first_name) *
from t
),
dedupe_five as (
select distinct on (email) *
from (
select distinct on (phone, first_name) *
from (
select distinct on (state, first_name) *
from t
) foo2
) foo
)
select *
from dedupe_five
1条答案
按热度按时间rjjhvcjd1#
我终于想出了解决办法。
我首先给每一行指定
id
,然后按照@topsail的建议,分别根据email、phones和first_names以及states和first_names删除重复项。此时,我有三个表。之后,如果在三个删除重复项的表中找到了原始表的行,我就在原始表中做一个标记。然后,如果原始表中有任何类型的标记,我就输出原始表中的任何行。如果您看到每一步的输出,这将是有意义的。
下面是整个查询:
dedupe_one
:dedupe_two
:dedupe_three
:marked_table
:final_table
:我不认为我的解决方案是解决问题的最有效或最优雅的方式。如果有人有更好的解决方案,请告诉我。