nutch 2.1 cassandra后端生成错误

xeufq47z  于 2021-06-14  发布在  Cassandra
关注(0)|答案(0)|浏览(174)

我选择了Cassandra作为后台,开始和努奇一起玩。
dmoz url的一小部分(~50k),all(inject、generate、fetch)运行良好。
但是,在我注入整个dmoz url集(~3.5m)并尝试生成fetchlist之后,我得到了以下错误,在另一个系统上可以重现:

~/software/nutch_dmoz/local$ ./bin/nutch generate -topN 1000
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: topN: 1000
GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:191)
    at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:213)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:241)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:249)

日志/hadoop.log:

2013-04-25 17:58:07,986 INFO  crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch.
2013-04-25 17:58:08,007 INFO  crawl.GeneratorJob - GeneratorJob: starting
2013-04-25 17:58:08,007 INFO  crawl.GeneratorJob - GeneratorJob: filtering: true
2013-04-25 17:58:08,007 INFO  crawl.GeneratorJob - GeneratorJob: topN: 1000
2013-04-25 17:58:08,570 INFO  connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10
s
2013-04-25 17:58:08,660 INFO  service.JmxMonitor - Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorT
ype=hector
2013-04-25 17:58:09,029 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes w
here applicable
2013-04-25 17:58:09,403 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2013-04-25 17:58:09,435 INFO  plugin.PluginRepository - Plugins: looking in: /home/sethunder/software/nutch_dmoz/local/plugins
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository - Registered Plugins:
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Tika Parser Plug-in (parse-tika)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         Anchor Indexing Filter (index-anchor)
2013-04-25 17:58:09,560 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository - Registered Extension-Points:
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Parse Filter (org.apache.nutch.parse.ParseFilter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2013-04-25 17:58:09,561 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2013-04-25 17:58:09,582 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-04-25 17:58:09,582 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-04-25 17:58:09,582 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-04-25 17:58:11,046 INFO  regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2013-04-25 18:01:02,936 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-04-25 18:01:02,936 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.ArrayIndexOutOfBoundsException
2013-04-25 18:01:03,412 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

据我所知,我没有用完磁盘空间。/tmp分区有250g的可用空间,cassandra所在的分区有2.5t的可用空间。有没有可能增加冗长的内容?另外,我想知道arrayoutofboundsexception没有告诉它试图访问的绑定,只是什么都没有。键空间网页是存在的,我可以访问它与Cassandracli。下面是readdb-stats的输出:

~/software/nutch_dmoz/local$ ./bin/nutch readdb -stats
WebTable statistics start
Statistics for WebTable: 
min score:  55.0
retry 0:    3576393
jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score:  1.0
TOTAL urls: 3576393
status 0 (null):    3576393
avg score:  1.0
WebTable statistics: done
min score:  55.0
retry 0:    3576393
jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score:  1.0
TOTAL urls: 3576393
status 0 (null):    3576393
avg score:  1.0

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题