spark查询出错,因为处理阻塞在一个stage中,并一直阻塞到磁盘满为止

muk1a3rh  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(427)

我在spark上执行下面的查询,但它不工作。当到达第13阶段时,它就会阻塞。而且磁盘空间在增加,而在同一阶段被阻塞什么也不做,然后当磁盘满了。查询有问题,您看到spark查询有什么问题了吗?
首先,我在配置单元中创建一个视图:

create view q2_min_ps_supplycost as
select
    p_partkey as min_p_partkey,
    min(ps_supplycost) as min_ps_supplycost
from
    part,
    partsupp,
    supplier,
    nation,
    region
where
    p_partkey = ps_partkey
    and s_suppkey = ps_suppkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'EUROPE'
group by
    p_partkey;

然后在spark中使用的查询与hivecontext:

select
        s_acctbal,
        s_name,
        n_name,
        p_partkey,
        p_mfgr,
        s_address,
        s_phone,
        s_comment
    from
        part,
        supplier,
        partsupp,
        nation,
        region,
        q2_min_ps_supplycost
    where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 37
        and p_type like '%COPPER'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and ps_supplycost = min_ps_supplycost
        and p_partkey = min_p_partkey
    order by
        s_acctbal desc,
        n_name,
        s_name,
        p_partkey
    limit 100;
omjgkv6w

omjgkv6w1#

您可以将查询分为多个查询,这样您只需在每个查询中连接两个表,就可以在最后一个查询中获得相同的结果,这将最小化中间文件的大小,并应避免阻塞。

相关问题