nutch 2.2.1 pdf解析

2hh7jdfx  于 2021-06-03  发布在  Hadoop
关注(0)|答案(0)|浏览(280)

一直在尝试用nutch和tika(pdfbox 1.8.3)解析pdf文档。
在我尝试解析的5个pdf中,使用:

danny@Ubuntu-64:~/Nutch$ ./bin/nutch parsechecker file:///home/danny/Documents/DOC-443.pdf

我得到的唯一结果是:

fetching: file:///home/danny/Documents/DOC-443.pdf
parsing: file:///home/danny/Documents/DOC-443.pdf
contentType: application/pdf
signature: 662453bc32a42af13cb4d5844d978cfc
---------
Url
---------------
file:///home/danny/Documents/DOC-443.pdf
---------
Metadata
---------
xmpTPg:NPages :     0
Content-Type :  application/pdf

我的hadoop.log是:

2013-12-20 11:29:41,646 INFO  parse.ParserChecker - fetching: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,174 INFO  crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2013-12-20 11:29:42,209 INFO  parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it  in the parse-plugins.xml file
2013-12-20 11:29:42,518 WARN  pdfparser.PDFParser - Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@2d4b2312
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:604)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1224)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1189)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123)
    at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:116)
    at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
    at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
2013-12-20 11:29:42,521 WARN  pdfparser.XrefTrailerResolver - Did not found XRef object at specified startxref position 0
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - parsing: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - contentType: application/pdf
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - signature: 662453bc32a42af13cb4d5844d978cfc
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - ---------
Url
---------------
2013-12-20 11:29:42,612 INFO  parse.ParserChecker - ---------
Metadata
---------

有人能找出什么毛病吗?这两天我一直在想办法。升级/降级pdfbox,重建nutch等。似乎没有什么可以解决这个问题?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题