我是大数据开发的新手。我有一个从hdfs读取数据的用例,通过spark处理并保存到MySQL数据库。保存到MySQL数据库的原因是报告工具指向MySQL。所以我想出了下面的流程来实现它。有人能验证并建议需要任何优化/更改吗?
val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("mode","failfast")
.load("hdfs://localhost:9000/user/testuser/samples.csv")
val resultsdf = df.select("Sample","p16","Age","Race").filter($"Anatomy".like("BOT"))
val prop=new java.util.Properties
prop.setProperty("driver", "com.mysql.cj.jdbc.Driver")
prop.setProperty("user", "root")
prop.setProperty("password", "pw")
val url = "jdbc:mysql://localhost:3306/meta"
df.write.mode(SaveMode.Append).jdbc(url,"sample_metrics",prop)
1条答案
按热度按时间a0zr77ik1#
Change is required in this line
val resultdf= ...
, you are using column Anatomy for filtering but you didn't select that column is select clause. Add that column otherwise you will end up with error-Analysis Exception unable to resolve column Anatomy.
Optimizations: You can use addtional properties like
numPartitions
andbatchsize
. You can read about these properties here .