impala:如何查询具有不同模式的多个Parquet文件

zfycwa2u 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(391)

在spark 2.1中，我经常使用

df = spark.read.parquet(/path/to/my/files/*.parquet)

加载Parquet文件的文件夹，即使使用不同的模式。然后我使用sparksql对Dataframe执行一些sql查询。
现在我想试试 Impala ，因为我读了维基的文章，里面有这样的句子：
ApacheImpala是一个开源的大规模并行处理（mpp）sql查询引擎，用于存储在运行ApacheHadoop[…]的计算机集群中的数据。
读取hadoop文件格式，包括text、lzo、sequencefile、avro、rcfile和parquet。
所以听起来它也适合我的用例（而且可能执行得更快一些）。
但当我尝试这样的事情时：

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/*.parquet'
STORED AS PARQUET
LOCATION '/tmp';

我有个例外
analysisexception:无法推断架构，路径不是文件
所以现在我的问题是：有没有可能读取一个文件夹，其中包含多个Parquet文件与 Impala ？impala会像spark那样执行模式合并吗？执行此操作需要什么查询？使用google找不到任何关于它的信息(总是坏兆头……）
谢谢！

hadoop impala apache-spark-sql parquet

来源：https://stackoverflow.com/questions/48343036/impala-how-to-query-against-multiple-parquet-files-with-different-schemata

1条答案

按热度按时间

insrf1ej1#

据我所知，你有一些Parquet文件，你想看看他们通过 Impala 表？下面是我的解释。
您可以创建一个外部表，并将位置设置为parquet files目录，如下所示

CREATE EXTERNAL TABLE ingest_parquet_files(col1 string, col2 string) LOCATION "/path/to/my/files/" STORED AS PARQUET;

在创建表之后，您还有另一个加载Parquet文件的选项

LOAD DATA INPATH "Your/HDFS/PATH" INTO TABLE schema.ingest_parquet_files;

您正在尝试的也会起作用，您必须删除通配符，因为它需要在like parquet之后有一个路径，并在该位置查找文件。

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/'
STORED AS PARQUET
LOCATION '/tmp';

下面是您可以参考的模板，它来自cloudera impala doc。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE PARQUET 'hdfs_path_of_parquet_file'
  [COMMENT 'table_comment']
  [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
  [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
  ]
  [LOCATION 'hdfs_path']
  [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
data_type:
    primitive_type
  | array_type
  | map_type
  | struct_type

请注意，您使用的用户应该对您提供给impala的任何路径具有读写访问权限。您可以通过执行以下步骤来实现


# Login as hive superuser to perform the below steps

create role <role_name_x>;

# For granting to database

grant all on database to role <role_name_x>;

# For granting to HDFS path

grant all on URI '/hdfs/path' to role <role_name_x>;

# Granting the role to the user you will use to run the impala job

grant role <role_name_x> to group <your_user_name>;

# After you perform the below steps you can validate with the below commands

# grant role should show the URI or database access when you run the grant role check on the role name as below

show grant role <role_name_x>;

# Now to validate if the user has access to the role

show role grant group <your_user_name>;

更多关于角色和权限的信息

赞(0）回复(0）举报 2021-05-29

我来回答

impala:如何查询具有不同模式的多个Parquet文件

1条答案

相关问题

热门标签

最新问答