(apache spark新手)
我试着创建一个小的scala spark应用程序,它可以读取excel文件并将数据插入数据库,但是我有一些错误,这些错误是由于库的版本不同而发生的(我想)。
Scala v2.12
Spark v3.0
Spark-Excel v0.13.1
maven配置为:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.crealytics/spark-excel -->
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.12</artifactId>
<version>0.13.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.11.1</version>
</dependency>
</dependencies>
main.scala公司
val spark = SparkSession
.builder
.appName("SparkApp")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val path = "file_path"
val excel = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("inferSchema", "false")
.option("location", path)
.option("addColorColumns", "false")
.load()
println(s"excel count is ${excel.count}")
错误是:
Exception in thread "main" scala.MatchError: Map(treatemptyvaluesasnulls -> false, location -> file_path, useheader -> true, inferschema -> false, addcolorcolumns -> false) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap)
at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:38)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:28)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
at main.scala.Main$.main(Main.scala:42)
at main.scala.Main.main(Main.scala)
这只发生在我尝试读取excel文件时,因为我使用spark excel库。csv或tsv工作正常。
2条答案
按热度按时间6rqinv9w1#
我知道这不能直接回答你的问题,但这可能仍然有助于你解决你的问题。
您可以使用python的pandas包。
用pandas和python读入excel文件
将pandasDataframe转换为sparkDataframe
将Pypark另存为Parquet地板/Hive桌
用scala和spark加载数据
7gcisfzg2#
我想,你忘了在
load
就像spark.read....load("Worktime.xlsx")
示例-参考-自述