我选择了Cassandra作为后台,开始和努奇一起玩。
dmoz url的一小部分(~50k),all(inject、generate、fetch)运行良好。
但是,在我注入整个dmoz url集(~3.5m)并尝试生成fetchlist之后,我得到了以下错误,在另一个系统上可以重现:
~/software/nutch_dmoz/local$ ./bin/nutch generate -topN 1000
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: topN: 1000
GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:191)
at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:213)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:241)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:249)
日志/hadoop.log:
2013-04-25 17:58:07,986 INFO crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch.
2013-04-25 17:58:08,007 INFO crawl.GeneratorJob - GeneratorJob: starting
2013-04-25 17:58:08,007 INFO crawl.GeneratorJob - GeneratorJob: filtering: true
2013-04-25 17:58:08,007 INFO crawl.GeneratorJob - GeneratorJob: topN: 1000
2013-04-25 17:58:08,570 INFO connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10
s
2013-04-25 17:58:08,660 INFO service.JmxMonitor - Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorT
ype=hector
2013-04-25 17:58:09,029 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes w
here applicable
2013-04-25 17:58:09,403 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2013-04-25 17:58:09,435 INFO plugin.PluginRepository - Plugins: looking in: /home/sethunder/software/nutch_dmoz/local/plugins
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Registered Plugins:
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Registered Extension-Points:
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Parse Filter (org.apache.nutch.parse.ParseFilter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2013-04-25 17:58:09,582 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-04-25 17:58:09,582 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-04-25 17:58:09,582 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-04-25 17:58:11,046 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2013-04-25 18:01:02,936 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-04-25 18:01:02,936 WARN mapred.LocalJobRunner - job_local_0001
java.lang.ArrayIndexOutOfBoundsException
2013-04-25 18:01:03,412 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
据我所知,我没有用完磁盘空间。/tmp分区有250g的可用空间,cassandra所在的分区有2.5t的可用空间。有没有可能增加冗长的内容?另外,我想知道arrayoutofboundsexception没有告诉它试图访问的绑定,只是什么都没有。键空间网页是存在的,我可以访问它与Cassandracli。下面是readdb-stats的输出:
~/software/nutch_dmoz/local$ ./bin/nutch readdb -stats
WebTable statistics start
Statistics for WebTable:
min score: 55.0
retry 0: 3576393
jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score: 1.0
TOTAL urls: 3576393
status 0 (null): 3576393
avg score: 1.0
WebTable statistics: done
min score: 55.0
retry 0: 3576393
jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score: 1.0
TOTAL urls: 3576393
status 0 (null): 3576393
avg score: 1.0
暂无答案!
目前还没有任何答案,快来回答吧!