这个问题在这里已经有答案了:
如何在没有hive-site.xml的情况下将spark sql连接到远程hive元存储(通过thrift协议)(8个答案)
10个月前关门了。
我有一个在独立的clouderavm上运行的简单程序。我已经在hive中创建了一个托管表,我想在apachespark中读取它,但是没有建立到hive的初始连接。请告知。
我在intellij中运行这个程序,我已经将hive-site.xml从/etc/hive/conf复制到/etc/spark/conf,即使spark作业没有连接到hive metastore
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(ConnectToHive.class.getName())
.config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
.enableHiveSupport()
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
HiveContext hiveContext = new HiveContext(sparkSession);
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");
hiveContext.sql("SHOW DATABASES").show();
hiveContext.sql("SHOW TABLES").show();
sparkSession.close();
}
输出如下图所示,在这里可以看到“employee table”,以便查询。因为我是在单机上运行的,所以hivemetastore在本地mysql服务器上。
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
jdbc:mysql:/127.0.0.1/metastore?createdatabaseifnotexist=true是配置单元metastore的配置
hive> show databases;
OK
default
sxm
temp
Time taken: 0.019 seconds, Fetched: 3 row(s)
hive> use default;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
employee
Time taken: 0.014 seconds, Fetched: 1 row(s)
hive> describe formatted employee;
OK
# col_name data_type comment
id string
firstname string
lastname string
addresses array<struct<street:string,city:string,state:string>>
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Tue Jul 25 06:33:01 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1500989581
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.07 seconds, Fetched: 29 row(s)
hive>
添加Spark日志
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/25 11:38:30 INFO SparkContext: Running Spark version 2.1.0
17/07/25 11:38:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/25 11:38:30 INFO SecurityManager: Changing view acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing view acls groups to:
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls groups to:
17/07/25 11:38:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); groups with view permissions: Set(); users with modify permissions: Set(cloudera); groups with modify permissions: Set()
17/07/25 11:38:31 INFO Utils: Successfully started service 'sparkDriver' on port 55232.
17/07/25 11:38:31 INFO SparkEnv: Registering MapOutputTracker
17/07/25 11:38:31 INFO SparkEnv: Registering BlockManagerMaster
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/07/25 11:38:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eb1e611f-1b88-487f-b600-3da1ff8353db
17/07/25 11:38:31 INFO MemoryStore: MemoryStore started with capacity 1909.8 MB
17/07/25 11:38:31 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/25 11:38:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/25 11:38:31 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
17/07/25 11:38:31 INFO Executor: Starting executor ID driver on host localhost
17/07/25 11:38:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41433.
17/07/25 11:38:31 INFO NettyBlockTransferService: Server created on 10.0.2.15:41433
17/07/25 11:38:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/07/25 11:38:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:41433 with 1909.8 MB RAM, BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:32 INFO SharedState: Warehouse path is 'file:/home/cloudera/works/JsonHive/spark-warehouse/'.
17/07/25 11:38:32 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/07/25 11:38:32 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/07/25 11:38:32 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
17/07/25 11:38:32 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
17/07/25 11:38:32 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/07/25 11:38:32 INFO ObjectStore: ObjectStore, initialize called
17/07/25 11:38:32 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/07/25 11:38:32 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/07/25 11:38:34 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/07/25 11:38:35 INFO ObjectStore: Initialized ObjectStore
17/07/25 11:38:36 INFO HiveMetaStore: Added admin role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: Added public role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: No user is added in admin role, since config is empty
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_all_databases
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_all_databases
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_functions: db=default pat=*
17/07/25 11:38:36 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/76258222-81db-4ac1-9566-1d8f05c3ecba_resources
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba/_tmp_space.db
17/07/25 11:38:36 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/cloudera/works/JsonHive/spark-warehouse/
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: default
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: default
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: global_temp
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: global_temp
17/07/25 11:38:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Process finished with exit code 0
更新
/usr/lib/hive/conf/hive-site.xml不在类路径中,因此它没有读取表,在将它添加到类路径中之后,它工作正常。。。因为我是从intellij跑过来的,所以我有这个问题。。在生产中,spark conf文件夹将链接到hive-site.xml。。。
1条答案
按热度按时间mcdcgff01#
这是一个提示,说明您没有连接到远程配置单元元存储(您已将其设置为mysql),并且xml文件在类路径上不正确。
在进行sparksession之前,可以不使用xml以编程方式完成
如何在sparksql中以编程方式连接到配置单元元存储?