如何从pyspark Dataframe 创建Kudu表

e3bfsja2 于 2023-01-16 发布在 Spark

关注(0)|答案(1)|浏览(168)

我尝试用一种简单的方法从pyspark向non-existing kudu表写入数据

df.write.format('org.apache.kudu.spark.kudu') \
        .option('kudu.master', kudu_master) \
        .option('kudu.table', kudu_table) \
        .mode("Append") \
        .save()

但我得到了例外

py4j.protocol.Py4JJavaError: An error occurred while calling o92.save.
: org.apache.kudu.client.NonRecoverableException: the table does not exist: table_name: "kudu_table"

我希望这个表可以像在其他数据库类型中一样创建，我是否遗漏了什么，或者Kudu表是否需要预先创建？

经过一番搜寻，

我试图直接调用底层函数，我可以创建kuduContext，但是要创建表，我必须 Package 所有需要的对象，例如;模式、模式列等......由于某种原因，互联网上没有太多这方面的信息

kc = sc._jvm.org.apache.kudu.spark.kudu.KuduContext(kudu_master, sc._jsc.sc()) # working
print(kc.tableExists("test_table")) #working
kc.createTable("test_table", sc._jvm.org.apache.kudu.Schema(data.schema), sc._jvm.org.apache.kudu.client.CreateTableOptions().addHashPartitions(list("myKey"), 3)) #not working

pyspark

来源：https://stackoverflow.com/questions/74142204/how-to-create-kudu-table-from-pyspark-dataframe