何时在spark中执行刷新表my\u table?

w8f9ii69  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(877)

考虑一个代码;

import org.apache.spark.sql.hive.orc._
 import org.apache.spark.sql._

 val path = ...
 val dataFrame:DataFramew = ...

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
 dataFrame.createOrReplaceTempView("my_table")
 val results = hiveContext.sql(s"select * from my_table")
 results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
 hiveContext.sql("REFRESH TABLE my_table")

此代码使用相同的路径但不同的Dataframe执行两次。第一次运行成功,但随后出现错误:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

我试图清理缓存,调用 hiveContext.dropTempTable("tableName") 而且都没有效果。什么时候打电话 REFRESH TABLE tableName 之前、之后(其他变体)要修复这样的错误吗?

ippsafx7

ippsafx71#

你可以跑了 spark.catalog.refreshTable(tableName) 或者 spark.sql(s"REFRESH TABLE $tableName") 就在写操作之前。我也有同样的问题,它解决了我的问题。

spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)

相关问题