我对apacehudi是一个新手,尝试使用sparkshell在hudi表中编写Dataframe。第一次输入时,我没有创建任何表,也没有以覆盖模式写入,所以我希望它会创建hudi表。
spark-shell \
--packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
//Initialize a Spark Session for Hudi
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.spark.sql.SparkSession
val spark1 = SparkSession.builder().appName("hudi-datalake").master("local[*]").config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").config("spark.sql.hive.convertMetastoreParquet", "false").getOrCreat ()
//Write to a Hudi Dataset
val inputDF = Seq(
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
).toDF("id", "creation_date", "last_update_time")
val hudiOptions = Map[String,String](
HoodieWriteConfig.TABLE_NAME -> "work.hudi_test",
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "work.hudi_test",
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName)
// Upsert Data
// Create a new DataFrame from the first row of inputDF with a different creation_date value
val updateDF = inputDF.limit(1).withColumn("creation_date", lit("2014-01-01"))
updateDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.overwrite).saveAsTable("work.hudi_test")
while writing this write statement i m getting below error message.
java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
有谁能指点一下我该怎么写这句话。
1条答案
按热度按时间sqxo8psd1#
以下是pyspark中您的问题的工作示例:
输出:
文件系统中的hudi表如下所示:
注意:由于您正在修改分区列(2015-01-01->2014-01-01),更新操作实际上创建了一个新分区并执行插入操作。你可以在输出中看到。
我提供了一个更新示例,它将上次更新时间更新为2016-01-01t13:51:39.340396z,实际上将分区2015-01-01中的id 100从2015-01-01t13:51:39.340396z更新为2016-01-01t13:51:39.340396z
更多示例可在hudi快速入门指南中找到