我正在尝试在ubuntu14.04上部署nutch2.3+elasticsearch 1.4+hbase 0.94。当我尝试开始爬网时,会执行以下操作:
$NUTCH_ROOT/runtime/local/bin/nutch inject urls
我得到:
InjectorJob: starting at 2017-10-12 19:27:48
InjectorJob: Injecting urlDir: urls
这个过程持续了几个小时。
我怎么知道发生了什么事?
配置文件:
nutch-site.xml文件
<configuration>
<property>
<name>http.agent.name</name>
<value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
</property>
<property>
<name>http.robots.agents</name>
<value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
<name>plugin.includes</name>
<!-- do**NOT**enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
<value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value> <!-- do not leave the seeded domains (optional) -->
</property>
<property>
<name>elastic.host</name>
<value>localhost</value> <!-- where is ElasticSearch listening -->
</property>
</configuration>
hbase-site.xml文件
<configuration>
<property>
<name>hbase.rootdir</name>
<value>/home/kike/RIWS/hbase-0.94.14/</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
</configuration>
日志文件:
hbase主日志
2017-10-12 19:27:49,593 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47778
2017-10-12 19:27:49,596 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47778
2017-10-12 19:27:49,609 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0017 with negotiated timeout 40000 for client /127.0.0.1:47778
2017-10-12 19:31:11,092 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=1.99 MB, free=239.7 MB, max=241.69 MB, blocks=2, accesses=18, hits=16, hitRatio=88,88%, , cachingAccesses=18, cachingHits=16, cachingHitsRatio=88,88%, , evictions=0, evicted=0, evictedPerRun=NaN
2017-10-12 19:31:24,623 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows using org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@1646b7c
2017-10-12 19:31:24,630 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 0 catalog row(s) and gc'd 0 unreferenced parent region(s)
2017-10-12 19:32:13,832 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x15f11684f3f0017
2017-10-12 19:32:13,849 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:47778 which had sessionid 0x15f11684f3f0017
2017-10-12 19:32:14,852 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47817
2017-10-12 19:32:14,853 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47817
2017-10-12 19:32:14,880 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0018 with negotiated timeout 40000 for client /127.0.0.1:47817
hadoop日志
2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: starting at 2017-10-12 19:27:48
2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls
编辑:
过了一会儿,hadoop日志显示:
2017-10-12 20:34:59,333 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:133)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:139)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115)
... 9 more
但是如果我键入jps,我可以看到hmaster正在运行:
31672 Jps
20553 HMaster
19739 Elasticsearch
1条答案
按热度按时间nfzehxib1#
错误日志显示:(hbase.masternotrunningexception)
我们需要设置hbase
open ~/Desktop/Nutch/hbase/conf/hbase-site.xml
并添加以下两个节点。我们需要告诉你hbase
这个rootdir
并指定的数据目录zookeeper
.接下来,我们要告诉你
gora
使用Hbase
因为它是默认的数据存储。我们需要添加/取消注解
gora-hbase
依赖于我们的ivy.xml
(可能是第118行)。测试hbase
还应遵循一些测试步骤:
首先检查版本兼容性。
确保已设置java\u home和nutch\u java\u home环境变量
编译nutch[您需要使用ant(
ant runtime
) ]