在Python中从PDF中提取页面大小

j8yoct9x 于 2023-06-20 发布在 Python

关注(0)|答案(9)|浏览(256)

我想阅读一个PDF，并获得其页面列表和每页的大小。我不需要以任何方式操纵它，只要阅读它。
目前正在试用pyPdf，它做了我需要的一切，除了一种获得页面大小的方法。理解我可能必须迭代，因为页面大小在PDF文档中可能会有所不同。是否有其他的库/方法可以使用？
我试着使用PIL，一些在线食谱甚至有d=Image（imagefilename）的用法，但它从来没有读过我的任何PDF-它读的一切，我扔在它-甚至一些事情，我不知道PIL可以做。
任何指导赞赏-我在windows 7 64，python 25（因为我也做GAE的东西），但我很高兴在Linux或更现代的pythiis。

python

来源：https://stackoverflow.com/questions/6230752/extracting-page-sizes-from-pdf-in-python

9条答案

按热度按时间

neskvpey1#

这可以通过pypdf来实现：

>>> from pypdf import PdfReader
>>> reader = PdfReader('example.pdf')
>>> box = reader.pages[0].mediabox
>>> box
RectangleObject([0, 0, 612, 792])
>>> box.width
Decimal('612')
>>> box.height
Decimal('792')

（以前称为pyPdf/PyPDF2）

赞(0）回复(0）举报 2023-06-20

pod7payv2#

2021-07-22更新：原来的答案并不总是正确的，所以我更新了我的答案。
PyMuPDF：

>>> import fitz
>>> doc = fitz.open("example.pdf")
>>> page = doc[0]
>>> print(page.rect.width, page.rect.height)
842.0 595.0
>>> print(page.mediabox.width, page.mediabox.height)
595.0 842.0

mediabox和rect的返回值类型为Rect，它具有属性“width”和“height”。mediabox和rect之间的一个区别是mediabox与文档中的/MediaBox相同，并且在页面旋转时不会改变。但是，rect受旋转的影响。有关PyMuPDF中不同框的更多信息，您可以阅读术语表。

赞(0）回复(0）举报 2023-06-20

velaa5lx3#

pdfrw：

>>> from pdfrw import PdfReader
>>> pdf = PdfReader('example.pdf')
>>> pdf.pages[0].MediaBox
['0', '0', '595.2756', '841.8898']

长度以磅为单位（1磅= 1/72英寸）。格式为[x0, y0, x1, y1]（谢谢mara 004！）.

赞(0）回复(0）举报 2023-06-20

e4eetjau4#

对于pdfminer python 3.x（pdfminer.six）（未在python 2.7上尝试）：

parser = PDFParser(open(pdfPath, 'rb'))
doc = PDFDocument(parser)
pageSizesList = []
for page in PDFPage.create_pages(doc):
    print(page.mediabox) # <- the media box that is the page size as list of 4 integers x0 y0 x1 y1
    pageSizesList.append(page.mediabox) # <- appending sizes to this list. eventually the pageSizesList will contain list of list corresponding to sizes of each page

赞(0）回复(0）举报 2023-06-20

bn31dyow5#

使用pikepdf：

import pikepdf

# open the file and select the first page
pdf = pikepdf.Pdf.open("/path/to/file.pdf")
page = pdf.pages[0]

if '/CropBox' in page:
    # use CropBox if defined since that's what the PDF viewer would usually display
    relevant_box = page.CropBox
elif '/MediaBox' in page:
    relevant_box = page.MediaBox
else:
    # fall back to ANSI A (US Letter) if neither CropBox nor MediaBox are defined
    # unlikely, but possible
    relevant_box = [0, 0, 612, 792]

# actually there could also be a viewer preference ViewArea or ViewClip in
# pdf.Root.ViewerPreferences defining which box to use, but most PDF readers 
# disregard this option anyway

# check whether the page defines a UserUnit
userunit = 1
if '/UserUnit' in page:
    userunit = float(page.UserUnit)

# convert the box coordinates to float and multiply with the UserUnit
relevant_box = [float(x)*userunit for x in relevant_box]

# obtain the dimensions of the box
width  = abs(relevant_box[2] - relevant_box[0])
height = abs(relevant_box[3] - relevant_box[1])

rotation = 0
if '/Rotate' in page:
    rotation = page.Rotate

# if the page is rotated clockwise or counter-clockwise, swap width and height
# (pdf rotation modifies the coordinate system, so the box always refers to 
# the non-rotated page)
if (rotation // 90) % 2 != 0:
    width, height = height, width

# now you have width and height in points
# 1 point is equivalent to 1/72in (1in -> 2.54cm)

赞(0）回复(0）举报 2023-06-20

zf2sa74q6#

- 免责声明：**我是borb的作者，在这个答案中使用的库。

#!chapter_005/src/snippet_002.py
import typing
from borb.pdf import Document
from borb.pdf import PDF

def main():

    # read the Document
    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle)

    # check whether we have read a Document
    assert doc is not None

    # get the width/height
    w = doc.get_page(0).get_page_info().get_width()
    h = doc.get_page(0).get_page_info().get_height()

    # do something with these dimensions
    # TODO

if __name__ == "__main__":
    main()

我们通过使用PDF.loads加载PDF开始代码。然后我们得到一个Page（您可以更改此代码以打印每个Page的尺寸，而不仅仅是Page 0）。从Page，我们得到PageInfo，它包含宽度和高度。
您可以使用pip安装borb：

pip install borb

您也可以从here源代码下载它。
如果你需要更多的例子，请查看examples repository。

赞(0）回复(0）举报 2023-06-20

9rnv2umw7#

使用pypdfium2：

import pypdfium2 as pdfium

PAGEINDEX = 0  # the first page
FILEPATH = "/path/to/file.pdf"
pdf = pdfium.PdfDocument(FILEPATH)

# option 1
width, height = pdf.get_page_size(PAGEINDEX)

# option 2
page = pdf[PAGEINDEX]
width, height = page.get_size()

# len(pdf) provides the number of pages, so you can iterate through the document

赞(0）回复(0）举报 2023-06-20

vnzz0bqm8#

另一种方法是使用popplerqt4

doc = popplerqt4.Poppler.Document.load('/path/to/my.pdf')
qsizedoc = doc.page(0).pageSize()
h = qsizedoc.height() # given in pt,  1pt = 1/72 in
w = qsizedoc.width()

赞(0）回复(0）举报 2023-06-20

w6mmgewl9#

***Python 3.9***和库***PyPDF2***正确代码：

from PyPDF2 import PdfFileReader

reader = PdfFileReader('C:\\MyFolder\\111.pdf')
box = reader.pages[0].mediaBox
print(box.getWidth())
print(box.getHeight())

对于***所有页面***：

from PyPDF2 import PdfFileReader

reader = PdfFileReader('C:\\MyFolder\\111.pdf')

i = 0
for p in reader.pages:
    box = p.mediaBox
    print(f"i:{i}   page:{i+1}   Width = {box.getWidth()}   Height = {box.getHeight()}")
    i=i+1
    
input("Press Enter to continue...")

赞(0）回复(0）举报 2023-06-20

我来回答

在Python中从PDF中提取页面大小

9条答案

相关问题

热门标签

最新问答