java 如何使用PDFBox查找PDF中的空白页?

olqngx59  于 2023-01-11  发布在  Java
关注(0)|答案(4)|浏览(890)

这是我目前面临的挑战。
我有很多PDF,我必须删除里面的空白页面,只显示有内容(文本或图像)的页面。
问题是那些pdf是扫描的文档。
所以空白页上有一些扫描仪留下的脏东西。

w8ntj3qf

w8ntj3qf1#

我做了一些研究,最后得到了这个代码,检查99%的页面为白色或浅灰色。我需要灰色因素,因为扫描的文档有时不是纯白的。

private static Boolean isBlank(PDPage pdfPage) throws IOException {
    BufferedImage bufferedImage = pdfPage.convertToImage();
    long count = 0;
    int height = bufferedImage.getHeight();
    int width = bufferedImage.getWidth();
    Double areaFactor = (width * height) * 0.99;

    for (int x = 0; x < width ; x++) {
        for (int y = 0; y < height ; y++) {
            Color c = new Color(bufferedImage.getRGB(x, y));
            // verify light gray and white
            if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue()
                    && c.getRed() >= 248) {
                 count++;
            }
        }
    }

    if (count >= areaFactor) {
        return true;
    }

    return false;
}
kb5ga3dv

kb5ga3dv2#

@Shoyo的代码适用于PDFBOX版本〈2.0。对于未来的读者,没有太大的变化,但以防万一,这里是**PDFBOX 2.0+**的代码,使您的生活更轻松。
main(main是指将PDF加载到PDDocument中的位置)方法中:

try {
    PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/tetml_ct_access/C.pdf"));
    PDFRenderer renderedDoc = new PDFRenderer(document);
    for (int pageNumber = 0; pageNumber < document.getNumberOfPages(); pageNumber++) {
        if(isBlank(renderedDoc.renderImage(pageNumber))) {
            System.out.println("Blank Page Number : " + pageNumber + 1);
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}

isBlank方法只会传入BufferedImage

private static Boolean isBlank(BufferedImage pageImage) throws IOException {
    BufferedImage bufferedImage = pageImage;
    long count = 0;
    int height = bufferedImage.getHeight();
    int width = bufferedImage.getWidth();
    Double areaFactor = (width * height) * 0.99;

    for (int x = 0; x < width; x++) {
        for (int y = 0; y < height; y++) {
            Color c = new Color(bufferedImage.getRGB(x, y));
            if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue() && c.getRed() >= 248) {
                count++;
            }
        }
    }
    if (count >= areaFactor) {
        return true;
    }
    return false;
}

所有的学分都归@Shoyo
更新日期:

有些PDF文件有**“This Page was Intentionally Left Blank”**,上面的代码将其视为空白。如果这是您的要求,请随意使用上面的代码。但是,我的要求只是过滤掉完全空白的页面(没有任何图像,也没有任何字体)。所以,我最终使用了以下代码(加上此代码运行速度更快:P):

public static void main(String[] args) {
    try {
        PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/CTP2040.pdf"));
        PDPageTree allPages = document.getPages();
        Integer pageNumber = 1;
        for (PDPage page : allPages) {
            Iterable<COSName> xObjects = page.getResources().getXObjectNames();
            Iterable<COSName> fonts = page.getResources().getFontNames();
            if(xObjects.spliterator().getExactSizeIfKnown() == 0 && fonts.spliterator().getExactSizeIfKnown() == 0) {
                System.out.println(pageNumber);                 
            }
            pageNumber++;
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

这将返回那些完全空白的页面的页码。

  • 希望这能帮助到别人!:)*
kd3sttzy

kd3sttzy3#

http://www.rgagnon.com/javadetails/java-detect-and-remove-blank-page-in-pdf.html

import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;

public class RemoveBlankPageFromPDF {

    // value where we can consider that this is a blank image
    // can be much higher or lower depending of what is considered as a blank page
    public static final int BLANK_THRESHOLD = 160;

    public static void removeBlankPdfPages(String source, String destination)
        throws IOException, DocumentException
    {
        PdfReader r = null;
        RandomAccessSourceFactory rasf = null;
        RandomAccessFileOrArray raf = null;
        Document document = null;
        PdfCopy writer = null;

        try {
            r = new PdfReader(source);
            // deprecated
            //    RandomAccessFileOrArray raf
            //           = new RandomAccessFileOrArray(pdfSourceFile);
            // itext 5.4.1
            rasf = new RandomAccessSourceFactory();
            raf = new RandomAccessFileOrArray(rasf.createBestSource(source));
            document = new Document(r.getPageSizeWithRotation(1));
            writer = new PdfCopy(document, new FileOutputStream(destination));
            document.open();
            PdfImportedPage page = null;

            for (int i=1; i<=r.getNumberOfPages(); i++) {
                // first check, examine the resource dictionary for /Font or
                // /XObject keys.  If either are present -> not blank.
                PdfDictionary pageDict = r.getPageN(i);
                PdfDictionary resDict = (PdfDictionary) pageDict.get( PdfName.RESOURCES );
                boolean noFontsOrImages = true;
                if (resDict != null) {
                  noFontsOrImages = resDict.get( PdfName.FONT ) == null &&
                                    resDict.get( PdfName.XOBJECT ) == null;
                }
                System.out.println(i + " noFontsOrImages " + noFontsOrImages);

                if (!noFontsOrImages) {
                    byte bContent [] = r.getPageContent(i,raf);
                    ByteArrayOutputStream bs = new ByteArrayOutputStream();
                    bs.write(bContent);
                    System.out.println
                      (i + bs.size() + " > BLANK_THRESHOLD " +  (bs.size() > BLANK_THRESHOLD));
                    if (bs.size() > BLANK_THRESHOLD) {
                        page = writer.getImportedPage(r, i);
                        writer.addPage(page);
                    }
                }
            }
        }
        finally {
            if (document != null) document.close();
            if (writer != null) writer.close();
            if (raf != null) raf.close();
            if (r != null) r.close();
        }
    }

    public static void main (String ... args) throws Exception {
        removeBlankPdfPages
            ("C://temp//documentwithblank.pdf", "C://temp//documentwithnoblank.pdf");
    }
}
ffdz8vbo

ffdz8vbo4#

@Pramesh Bajracharya,您在PDF文档中查找空白页的解决方案是完整的!
如果要求删除空白页,则可按如下方式增强相同代码

**列表空白页列表=新建数组列表();**适用于(PD页码:所有页面){可迭代的x对象=页面.获取资源().获取x对象名称();可迭代字体=页面.getResources().getFontNames();//条件来确定页面是否为空白页面if(xObjects.spliterator().getExactSizeIfKnown()== 0 &&字体.spliterator().getExactSizeIfKnown()== 0){页面移除列表.add(页码);}页码++;}

// remove the blank pages from the pdf document using the blank page numbers list
  **for( Integer i : blankPageList )
  {
    document.removePage( i );
  }**

相关问题