我想用spark编写scala代码,从配置单元服务器获取Dataframe-
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.security.UserGroupInformation
import scala.util.Properties
import org.apache.spark.sql.SparkSession
val configuration = new Configuration
configuration.set("hadoop.security.authentication", "Kerberos")
Properties.setProp("java.security.krb5.conf", krb5LocationInMySystem)
UserGroupInformation.setConfiguration(configuration)
UserGroupInformation.loginUserFromKeytab(principal,keytabLocation)
val spSession = SparkSession.builder().config("spark.master","local").config("spark.sql.warehouse.dir", "file:/Users/username/IdeaProjects/project_name/spark-warehouse/").enableHiveSupport().getOrCreate()
spSession.read.format("jdbc")
.option("url","jdbc:hive2://host:port/default;principal=hive/host@realm.com")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("dbtable", "tablename").load().show()
获得类似的输出
column1|column2|column3....
(只有这么多输出)
运行时,程序等待第一次说:
Will try to open client transport with JDBC Uri:(url)
Code generated in 159.970292 ms
然后是一些台词…然后是:
will try to open client transport with JDBC Uri:(url)
INFO JDBCRDD: closed connection
它给人一张空table
我已经找过了-
spark sql rdd加载在pyspark中,但不加载在spark submit中:“jdbcrdd:closed connection”
hive创建空表,即使有很多文件
配置单元表返回所有查询的空结果集
但要么他们没有解释我想要什么,要么我无法理解他们在说什么。对于第二个链接,我尝试过,但找不到如何在scala中使用setinputpathfilter
依赖项:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
暂无答案!
目前还没有任何答案,快来回答吧!