我试图将一个dataframe保存为一个外部表,该表将同时使用spark和hive进行查询,但不知何故,我无法使用hive查询或查看任何数据。它在spark上工作。
下面是如何重现问题:
scala> println(spark.conf.get("spark.sql.catalogImplementation"))
hive
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]
scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
下面是如何在配置单元中查询:
Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
+------------------+
| hive_random.col |
+------------------+
+------------------+
No rows selected (0.213 seconds)
但它在spark中运行良好:
scala> spark.sql("SELECT * FROM hive_random").show
+---+------+
| id|random|
+---+------+
| 3| 13.0|
| 15| 13.0|
...
| 8| 46.0|
| 9| 65.0|
+---+------+
3条答案
按热度按时间qnakjoqk1#
spark和hive中bucketed表的语义是不同的。
文档中有语义差异的细节。
它说
解决方法:如果需要从两个引擎进行读取,则需要从配置单元进行写入
j13ufse22#
我建议你不要做两件事。首先,尝试将配置单元执行引擎设置为使用spark。
set hive.execution.engine=spark;
其次,尝试在metastore中创建外部表,然后将数据保存到该表中。inb24sb23#
saveastable调用后出现警告。这就是暗示所在-
'持久化带扣数据源表
default
.hive_random
以特定于spark sql的格式导入配置单元元存储,该格式与配置单元不兼容原因是“saveastable”创建rdd分区而不是hive分区,解决方法是在调用dataframe.saveastable之前通过hql创建表。