雅典娜的Hive数据统计是不对的

sshcrbum 于 2021-06-25 发布在 Hive

关注(0)|答案(1)|浏览(404)

我有一张table，如附图所示。其数据来自firehose（最大缓冲区：128 mb或900秒）
当我尝试一个简单的计数，它返回一个尴尬的数字，296！虽然扫描数据的大小非常大，但只有12gb，而这个数据集中的每条记录只有5kb
当我尝试在glue job中加载和处理该数据集时，它将返回预期的计数：1778072
我不知道是不是因为这个领域 request_query 带类型 array<string> 由于作业用于实际工作流，所以有时我只想查询基本数据，如 ip , http_user_agent , ... 对于这些任务，这个模式已经足够了，无需编写另一个作业脚本并等待其成功
希望有办法解决

编辑

我从雅典娜控制台跑出来。下面是一些示例查询：

SELECT count(case when request_api = 'collections' then 1 end)
FROM "request_events"
where event_day = '2020-03-01'
and tenant_id = 'devsite.com'

SELECT request_api, count(*)
FROM "request_events"
where event_day = '2020-03-01'
and tenant_id = 'devsite.com'
group by request_api

编辑2
附件为我测试的12条记录的样本数据，结果是1条记录

Hive aws-glue amazon-athena aws-glue-data-catalog amazon-kinesis-firehose

来源：https://stackoverflow.com/questions/60647185/data-count-in-athena-is-not-right