aws雅典娜查询Parquet文件-where子句中的列

mrzz3bfm 于 2021-06-24 发布在 Hive

关注(0)|答案(1)|浏览(329)

我们计划使用athena作为s3中数据（作为分区中的Parquet文件存储）的后端服务。
我们感兴趣的是在查询的where子句中添加额外的列如何影响查询运行时。例如，我们在一个配置单元分区中有1000万条记录（分区基于'date'列）
下面所有的查询都返回相同的量——1000万。当我们在where子句中添加额外的列（因为parquet是columnarfomar）时，所有这些查询需要相同的时间还是减少了查询的运行？我试图测试这个，但结果并不一致，因为有一些排队时间，以及我猜 select * from table where date='20200712' select * from table where date='20200712' and type='XXX' select * from table where date='20200712' and type='XXX' and subtype='YYY'

Hive parquet amazon-web-services amazon-athena

来源：https://stackoverflow.com/questions/62870653/aws-athena-query-on-parquet-file-using-columns-in-where-clause

1条答案

按热度按时间

piok6c0g1#

parquet文件包含页面“索引”（最小、最大和bloom过滤器）。如果在插入过程中按有问题的列对数据进行排序，例如：

insert overwrite table mytable partition (dt)
select col1, --some columns
       type, 
       subtype, 
       dt
 distribute by dt
       sort by type, subtype

然后这些索引可以有效地工作，因为具有相同类型、子类型的数据将被加载到相同的页面中，并且使用索引选择数据页面。请参见此处的一些基准：https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
打开或按下：https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

赞(0）回复(0）举报 2021-06-24

我来回答

aws雅典娜查询Parquet文件-where子句中的列

1条答案

相关问题

热门标签

最新问答