无法在“saveastable”之后从配置单元查询spark df-spark sql特定格式,该格式与配置单元不兼容

v64noz0r  于 2021-05-29  发布在  Hadoop
关注(0)|答案(3)|浏览(419)

我试图将一个dataframe保存为一个外部表,该表将同时使用spark和hive进行查询,但不知何故,我无法使用hive查询或查看任何数据。它在spark上工作。
下面是如何重现问题:

scala> println(spark.conf.get("spark.sql.catalogImplementation"))
hive
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]

scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]                                      
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

下面是如何在配置单元中查询:

Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
+------------------+
| hive_random.col  |
+------------------+
+------------------+
No rows selected (0.213 seconds)

但它在spark中运行良好:

scala> spark.sql("SELECT * FROM hive_random").show
+---+------+                                                                    
| id|random|
+---+------+
|  3|  13.0|
| 15|  13.0|
...
|  8|  46.0|
|  9|  65.0|
+---+------+
qnakjoqk

qnakjoqk1#

spark和hive中bucketed表的语义是不同的。
文档中有语义差异的细节。
它说

Data is written to bucketed tables but the output does not adhere with expected  
bucketing spec. This leads to incorrect results when one tries to consume the  
Spark written bucketed table from Hive.

解决方法:如果需要从两个引擎进行读取,则需要从配置单元进行写入

j13ufse2

j13ufse22#

我建议你不要做两件事。首先,尝试将配置单元执行引擎设置为使用spark。 set hive.execution.engine=spark; 其次,尝试在metastore中创建外部表,然后将数据保存到该表中。

inb24sb2

inb24sb23#

saveastable调用后出现警告。这就是暗示所在-
'持久化带扣数据源表 default . hive_random 以特定于spark sql的格式导入配置单元元存储,该格式与配置单元不兼容
原因是“saveastable”创建rdd分区而不是hive分区,解决方法是在调用dataframe.saveastable之前通过hql创建表。

相关问题