我在spark上执行下面的查询,但它不工作。当到达第13阶段时,它就会阻塞。而且磁盘空间在增加,而在同一阶段被阻塞什么也不做,然后当磁盘满了。查询有问题,您看到spark查询有什么问题了吗?
首先,我在配置单元中创建一个视图:
create view q2_min_ps_supplycost as
select
p_partkey as min_p_partkey,
min(ps_supplycost) as min_ps_supplycost
from
part,
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
group by
p_partkey;
然后在spark中使用的查询与hivecontext:
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region,
q2_min_ps_supplycost
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 37
and p_type like '%COPPER'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost = min_ps_supplycost
and p_partkey = min_p_partkey
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;
1条答案
按热度按时间omjgkv6w1#
您可以将查询分为多个查询,这样您只需在每个查询中连接两个表,就可以在最后一个查询中获得相同的结果,这将最小化中间文件的大小,并应避免阻塞。