nutch只提取pdf文件

kxxlusnw 于 2021-06-03 发布在 Hadoop

关注(0)|答案(1)|浏览(418)

有没有办法从1-5级开始执行urlfilter，从5级开始执行不同的urlfilter。我需要提取pdf文件，这将是只有在给定的水平（只是为了实验）。
pdf文件将以二进制格式存储在爬网/段文件夹中。我想提取这些pdf文件，并存储在一个文件夹所有。我已经能够编写一个java程序来识别pdf文件。我不知道如何制作一个pdf文件，它的内容有相同的字体，页面，图像等。
执行爬网
合并段数据
运行makepdf.java
这仅标识pdf文件：

String uri = "/usr/local/nutch/framework/apache-nutch-1.6/merged572/20130407131335";
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    Path path = new Path(uri, Content.DIR_NAME + "/part-00000/data");

    SequenceFile.Reader reader = null;
    try {
      reader = new SequenceFile.Reader(fs, path, conf);
      Text key = new Text();
      Content content = new Content();
      while (reader.next(key, content)) {
          String contentType = content.getContentType();
          if (contentType.equalsIgnoreCase("application/pdf")) {
            //System.out.write( content.getContent(), 0, content.getContent().length );
            System.out.println(key);
          }
      }
      reader.close();
    } 
        finally {
        fs.close();
    }

hadoop apache nutch web-crawler search-engine

来源：https://stackoverflow.com/questions/15853628/nutch-to-extract-only-pdf-files