spark上的hive读取Parquet文件

smdncfj3  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(507)

我正试着把Parquet文件读入spark的Hive里。
所以我发现我应该做点什么:

  1. CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED
  2. AS AVRO TBLPROPERTIES ('avro.schema.url'='/files/events/avro_events_scheme.avsc');
  3. CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION '/files/events/parquet_events/';

我的avro方案是:

  1. {
  2. "type" : "parquet_file",
  3. "namespace" : "events",
  4. "name" : "events",
  5. "fields" : [
  6. { "name" : "category" , "type" : "string" },
  7. { "name" : "duration" , "type" : "long" },
  8. { "name" : "name" , "type" : "string" },
  9. { "name" : "user_id" , "type" : "string"},
  10. { "name" : "value" , "type" : "long" }
  11. ]
  12. }

结果我收到一个错误:

  1. org.apache.spark.sql.catalyst.parser.ParseException:
  2. Operation not allowed: ROW FORMAT SERDE is incompatible with format 'avro',
  3. which also specifies a serde(line 1, pos 0)
50pmv0ei

50pmv0ei1#

  1. I think we have to add inputforamt and outputformat classes.
  2. CREATE TABLE parquet_test
  3. ROW FORMAT SERDE
  4. 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  5. STORED AS INPUTFORMAT
  6. 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  7. OUTPUTFORMAT
  8. 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  9. TBLPROPERTIES (
  10. 'avro.schema.url''avro.schema.url'='/hadoop/avro_events_scheme.avsc');
  11. I hope above would work.

相关问题