一直在尝试用nutch和tika(pdfbox 1.8.3)解析pdf文档。
在我尝试解析的5个pdf中,使用:
danny@Ubuntu-64:~/Nutch$ ./bin/nutch parsechecker file:///home/danny/Documents/DOC-443.pdf
我得到的唯一结果是:
fetching: file:///home/danny/Documents/DOC-443.pdf
parsing: file:///home/danny/Documents/DOC-443.pdf
contentType: application/pdf
signature: 662453bc32a42af13cb4d5844d978cfc
---------
Url
---------------
file:///home/danny/Documents/DOC-443.pdf
---------
Metadata
---------
xmpTPg:NPages : 0
Content-Type : application/pdf
我的hadoop.log是:
2013-12-20 11:29:41,646 INFO parse.ParserChecker - fetching: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,174 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2013-12-20 11:29:42,209 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file
2013-12-20 11:29:42,518 WARN pdfparser.PDFParser - Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@2d4b2312
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:604)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1224)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1189)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:116)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
2013-12-20 11:29:42,521 WARN pdfparser.XrefTrailerResolver - Did not found XRef object at specified startxref position 0
2013-12-20 11:29:42,611 INFO parse.ParserChecker - parsing: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,611 INFO parse.ParserChecker - contentType: application/pdf
2013-12-20 11:29:42,611 INFO parse.ParserChecker - signature: 662453bc32a42af13cb4d5844d978cfc
2013-12-20 11:29:42,611 INFO parse.ParserChecker - ---------
Url
---------------
2013-12-20 11:29:42,612 INFO parse.ParserChecker - ---------
Metadata
---------
有人能找出什么毛病吗?这两天我一直在想办法。升级/降级pdfbox,重建nutch等。似乎没有什么可以解决这个问题?
暂无答案!
目前还没有任何答案,快来回答吧!