sql—所有属性上null和not null计数的分析值

yrdbyhpb  于 2021-07-13  发布在  Hive
关注(0)|答案(2)|浏览(353)

我有一个表,它在下表中有3个属性,使用hive/sql我有一个由50多个属性组成的用例,这些属性要求我对这个值nullcount,notnull等进行分析,根据当前的用例,我不能使用pyspark/plsql进行任何数据转换,这是我遇到的障碍
为了演示的目的,假设我有3个属性,对于马来西亚和泰国的2个表,它们只需要notnull计数

describe malaysia

ColumnName |  Datatype
custid     |  String
productid  |  String
addressid  |  String

describe thailand

ColumnName |  Datatype
custid     |  String
productid  |  String
addressid  |  String

按照我目前的方法,我就是这样做的,

select COUNT(CASE WHEN custid IS NOT NULL then 1 ELSE NULL END) as "NullCount",
COUNT(CASE WHEN productid IS NOT NULL then 1 ELSE NULL END) as "NullCount",
COUNT(CASE WHEN addressid IS NOT NULL then 1 ELSE NULL END) as "NullCount" 
from malaysia 
union
select COUNT(CASE WHEN custid IS NOT NULL then 1 ELSE NULL END) as "NullCount",
COUNT(CASE WHEN productid IS NOT NULL then 1 ELSE NULL END) as "NullCount",
COUNT(CASE WHEN addressid IS NOT NULL then 1 ELSE NULL END) as "NullCount" 
from thailand

因为我知道每个表有50多个属性,所以查询会很长,我觉得这不是正确的方法,因为维护会很混乱和困难。我想知道是否有更好的方法产生同样的结果。

6yoyoihd

6yoyoihd1#

修正你的数据模型!您应该将所有数据放在一个表中,表中有一列 country . 然后你可以简单地做:

select country,
       count(custid) as custid_not_null,
       count(productid) as productid_not_null,
       count(addressid) as addressid_not_null
from all_countries
group by country ;

请注意 count(<expression>) 统计非- NULL 表达式的值。这个 case 表达完全是多余的。
如果没有,您可以创建一个视图:

create view all_counties as
    select 'malaysia' as country, t.*
    from malaysia m
    union all
    select 'thailand' as country, t.*
    from thailand t
    union all
    . . .;

但我建议你把所有的数据放在一张表里。

ztigrdn8

ztigrdn82#

您需要稍微更改一下sql。用sum代替count。为了更好的分析,我增加了总数。我没有任何简单的方法,但你可以把列列表放在xl中,并从中创建一个公式,自动创建这个字符串 SUM(CASE WHEN custid IS NOT NULL then 1 ELSE 0 END) as "Nullcustid" .

select 
COUNT(*) "TotalCount",
SUM(CASE WHEN custid IS NOT NULL then 1 ELSE 0 END) as "NullCust",
SUM(CASE WHEN productid IS NOT NULL then 1 ELSE  0 END) as "Nullprod",
SUM(CASE WHEN addressid IS NOT NULL then 1 ELSE 0 END) as "Nulladdress" 
from malaysia 
union
select 
COUNT(*) "TotalCount",
SUM(CASE WHEN custid IS NOT NULL then 1 ELSE 0 END) as "NullCust",
SUM(CASE WHEN productid IS NOT NULL then 1 ELSE  0 END) as "Nullprod",
SUM(CASE WHEN addressid IS NOT NULL then 1 ELSE 0 END) as "Nulladdress" 
from thailand

相关问题