我想通过hivejdbc连接将整个配置单元表加载到spark内存中。并且已经在我的项目中添加了hive-site.xml、hdfs-site.xml。spark已连接配置单元,因为已成功获取列名(例如role\u id)。但是spark似乎将列名作为数据加载,并抛出一个异常。这是我的密码:
val df = spark.read.format("jdbc")
.option("driver", CommonUtils.HIVE_DIRVER)
.option("url", CommonUtils.HIVE_URL)
.option("dbtable", "datasource_test.t_leave_map_base")
.option("header", "true")
.option("user", CommonUtils.HIVE_PASSWORD)
.option("password", CommonUtils.HIVE_PASSWORD)
.option("fetchsize", "20")
.load()
df.registerTempTable("t_leave_map_base")
df.persist(StorageLevel.MEMORY_ONLY)
df.show()
df
获取错误:
java.lang.numberformatexception:对于输入字符串:“t\u leave\u map\u base.role\u id”位于java.lang.numberformatexception.forinputstring(numberformatexception)。java:65)~[na:1.8.0\u 25]在java.lang.long.parselong(long。java:589)~[na:1.8.0\u 25]在java.lang.long.valueof(long。java:803)~(na:1.8.0μ25)在org.apache.hive.jdbc.hivebaseresultset.getlong(hivebaseresultset。java:366)~[hive-jdbc-1.1.0-cdh5.12.0。jar:1.1.0-cdh5.12.0]位于org.apache.spark.sql.execution.datasources.jdbc.jdbcutils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$jdbcutils$$makegetter$8.apply(jdbcutils)。scala:409)~[spark-sql\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.sql.execution.datasources.jdbc.jdbcutils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$jdbcutils$$makegetter$8.apply(jdbcutils)。scala:408)~[spark-sql\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.sql.execution.datasources.jdbc.jdbcutils$$anon$1.getnext(jdbcutils。scala:330) ~[spark-sql\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.sql.execution.datasources.jdbc.jdbcutils$$anon$1.getnext(jdbcutils。scala:312)~[spark-sql\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.util.nextiterator.hasnext(nextiterator。scala:73)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.interruptibleiterator.hasnext(interruptibleiterator。scala:37)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]位于org.apache.spark.util.completioniterator.hasnext(completioniterator。scala:32)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.sql.catalyst.expressions.generatedclass$generateEditor.processnext(未知源)~[na:na]位于org.apache.spark.sql.execution.bufferedrowiterator.hasnext(bufferedrowiterator)。java:43)~[spark-sql\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.sql.execution.whitestagecodegenexec$$anonfun$8$$anon$1.hasnext(whitestagecodegenexec。scala:395)~[spark-sql\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]位于org.apache.spark.sql.execution.columnar.inmemoryrelation$$anonfun$1$$anon$1.hasnext(inmemoryrelation)。scala:133)~[spark-sql\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.storage.memory.memorystore.putiteratorasvalues(内存存储)。scala:215)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.storage.blockmanager$$anonfun$doputiterator$1.apply(blockmanager。scala:1038)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.storage.blockmanager$$anonfun$doputiterator$1.apply(blockmanager。scala:1029)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.storage.blockmanager.doput(blockmanager。scala:969)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.storage.blockmanager.doputierator(blockmanager。scala:1029)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.storage.blockmanager.getorelseupdate(blockmanager。scala:760)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.rdd.getorcompute(rdd。scala:334) ~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.rdd.iterator(rdd。scala:285)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd。scala:38)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd。scala:323)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.rdd.iterator(rdd。scala:287)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd。scala:38) ~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd。scala:323)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.rdd.rdd.iterator(rdd。scala:287)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.scheduler.resulttask.runtask(resulttask。scala:87)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.scheduler.task.run(task。scala:108)~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]在org.apache.spark.executor.executor$taskrunner.run(executor。scala:338) ~[spark-core\ 2.11-2.2.0.cloudera2。jar:2.2.0.cloudera2]位于java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor。java:1142)~[na:1.8.0\u 25]位于java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor。java:617)~[na:1.8.0\u25]在java.lang.thread.run(thread。java:745)[na:1.8.0掼25]
调试项目,所有fetchedrows都是列的名称:
我想知道sparksql是否支持这种方式加载配置单元表?
2条答案
按热度按时间2vuwiymt1#
您可以尝试一个简单的练习,看看spark.sql是否正在从配置单元获取数据。通常,我所理解的是jdbc不是从spark连接到hive的方式。
配置spark-env.sh参数以确保spark使用元存储信息与配置单元通信。
打开机器中的Spark壳。
在spark shell中,使用如下语句
pieyvz9o2#
我看到这个问题有各种各样的说法。
spark不使用jdbc访问配置单元。它位于内置的hadoop/hdfs域中。
spark可能会使用jdbc for impala访问kudu表,因为kudu的安全性太粗糙了。你可以用 Impala 来做Hive,但你为什么要这么做?