使用spark创建的avro文件,具有decimaltype字段

edqdpe6u  于 2021-06-24  发布在  Hive
关注(0)|答案(0)|浏览(274)

我使用spark2创建了avro数据文件,然后定义了一个指向avro数据文件的配置单元表。

val trades= spark.read.option("compression","gzip").csv("file:///data/nyse_all/nyse_data").select($"_c0".as("stockticker"),$"_c1".as("tradedate").cast(IntegerType),$"_c2".as("openprice").cast(DataTypes.createDecimalType(10,2)),$"_c3".as("highprice").cast(DataTypes.createDecimalType(10,2)),$"_c4".as("lowprice").cast(DataTypes.createDecimalType(10,2)),$"_c5".as("closeprice").cast(DataTypes.createDecimalType(10,2)),$"_c6".as("volume").cast(LongType))    
trades.repartition(4,$"tradedate",$"volume").sortWithinPartitions($"tradedate".asc,$"volume".desc).write.format("com.databricks.spark.avro").save("/user/pawinder/spark_practice/problem6/data/nyse_data_avro")

spark.sql("create external table pawinder.nyse_data_avro(stockticker string, tradedate int, openprice decimal(10,2) , highprice decimal(10,2), lowprice decimal(10,2), closeprice decimal(10,2), volume bigint) ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  STORED AS INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' location '/user/pawinder/spark_practice/problem6/data/nyse_data_avro'")

查询配置单元表失败,错误如下:
错误:java.lang.runtimeexception:org.apache.hadoop.hive.ql.metadata.hiveexception:处理可写org.apache.hadoop.hive.serde2.avro时发生配置单元运行时错误。avrogenericrecordwritable@178270b2 在org.apache.hadoop.hive.ql.exec.mr.execmapper.map(execmapper。java:172)在org.apache.hadoop.mapred.maprunner.run(maprunner。java:54)在org.apache.hadoop.mapred.maptask.runoldmapper(maptask。java:453)在org.apache.hadoop.mapred.maptask.run(maptask。java:343)在org.apache.hadoop.mapred.yarnchild$2.run(yarnchild。java:170)位于javax.security.auth.subject.doas(subject)的java.security.accesscontroller.doprivileged(本机方法)。java:422)在org.apache.hadoop.security.usergroupinformation.doas(用户组信息。java:1869)在org.apache.hadoop.mapred.yarnchild.main(yarnchild。java:164)原因:org.apache.hadoop.hive.ql.metadata.hiveexception:处理可写org.apache.hadoop.hive.serde2.avro时发生配置单元运行时错误。avrogenericrecordwritable@178270b2 在org.apache.hadoop.hive.ql.exec.mapoperator.process(mapoperator。java:563)在org.apache.hadoop.hive.ql.exec.mr.execmapper.map(execmapper。java:163) ... 8其他原因:org.apache.avro.avrotypeexception:找到字符串,需要联合
在一些调试中,我发现在avro数据文件中,定义为decimal(10,2)的数据类型被标记为string:

[pawinder@gw02 ~]$ hdfs dfs -cat /user/pawinder/spark_practice/problem6/data/nyse_data_avro/part-00003-f1ca3b0a-f0b4-4aa8-bc26-ca50a0a16fe3-c000.avro |more
    Objavro.schema▒{"type":"record","name":"topLevelRecord","fields":[{"name":"stockticker","type":["string","null"]},{"name":"tradedate","type":["int","null"]},{"name":"o
    penprice","type":["string","null"]},{"name":"highprice","type":["string","null"]},{"name":"lowprice","type":["string","null"]},{"name":"closeprice","type":["string","n
    ull"]},{"name":"volume","type":["long","null"]}]}

我可以在sparkshell中查询相同的配置单元表。avro serde是否无法识别spark sql decimaltype?我用的是spark 2.3。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题