python 如何获取书签的页码

omhiaaxx  于 2024-01-05  发布在  Python
关注(0)|答案(4)|浏览(173)
  1. from typing import List
  2. from PyPDF2 import PdfFileReader
  3. from PyPDF2.generic import Destination
  4. def get_outlines(pdf_filepath: str) -> List[Destination]:
  5. """Get the bookmarks of a PDF file."""
  6. with open(pdf_filepath, "rb") as fp:
  7. pdf_file_reader = PdfFileReader(fp)
  8. outlines = pdf_file_reader.getOutlines()
  9. return outlines
  10. print(get_outlines("PDF-export-example.pdf"))

字符串
pyPdf.pdf.Destination有很多属性,但我找不到该书签的引用页码。我如何获得书签的页码?
例如,outlines[1].page.idnum返回的数字大约是PDF文档中引用的页码的3倍,我假设引用的对象比页面小,因为在整个PDF文档大纲上运行.page.idnum返回的数字数组甚至与PDF文档中的“真实的”页码目标不线性相关,并且大约是3倍
更新:这个问题和这个一样:split a pdf based on outline虽然我不明白作者在他的自我回答中做了什么。对我来说似乎太复杂了,无法使用

7fyelxc5

7fyelxc51#

正如@theta指出的,“split a pdf based on outline“有提取页码所需的代码。如果你觉得这很复杂,我复制了部分将页面idMap到页码的代码,并将其变成一个函数。下面是一个打印书签o[0]页码的工作示例:

  1. from PyPDF2 import PdfFileReader
  2. def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
  3. if _result is None:
  4. _result = {}
  5. if pages is None:
  6. _num_pages = []
  7. pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()
  8. t = pages["/Type"]
  9. if t == "/Pages":
  10. for page in pages["/Kids"]:
  11. _result[page.idnum] = len(_num_pages)
  12. _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)
  13. elif t == "/Page":
  14. _num_pages.append(1)
  15. return _result
  16. # main
  17. f = open('document.pdf','rb')
  18. p = PdfFileReader(f)
  19. # map page ids to page numbers
  20. pg_id_num_map = _setup_page_id_to_num(p)
  21. o = p.getOutlines()
  22. pg_num = pg_id_num_map[o[0].page.idnum] + 1
  23. print(pg_num)

字符串
可能太晚了@theta,但可能会帮助别人:)顺便说一句,我的第一个职位上stackoverflow,所以请原谅我,如果我没有遵循通常的格式

**为了进一步扩展:**如果您正在寻找书签在页面上的确切位置,这将使您的工作更容易:

  1. from PyPDF2 import PdfFileReader
  2. import PyPDF2 as pyPdf
  3. def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
  4. if _result is None:
  5. _result = {}
  6. if pages is None:
  7. _num_pages = []
  8. pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()
  9. t = pages["/Type"]
  10. if t == "/Pages":
  11. for page in pages["/Kids"]:
  12. _result[page.idnum] = len(_num_pages)
  13. _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)
  14. elif t == "/Page":
  15. _num_pages.append(1)
  16. return _result
  17. def outlines_pg_zoom_info(outlines, pg_id_num_map, result=None):
  18. if result is None:
  19. result = dict()
  20. if type(outlines) == list:
  21. for outline in outlines:
  22. result = outlines_pg_zoom_info(outline, pg_id_num_map, result)
  23. elif type(outlines) == pyPdf.pdf.Destination:
  24. title = outlines['/Title']
  25. result[title.split()[0]] = dict(title=outlines['/Title'], top=outlines['/Top'], \
  26. left=outlines['/Left'], page=(pg_id_num_map[outlines.page.idnum]+1))
  27. return result
  28. # main
  29. pdf_name = 'document.pdf'
  30. f = open(pdf_name,'rb')
  31. pdf = PdfFileReader(f)
  32. # map page ids to page numbers
  33. pg_id_num_map = _setup_page_id_to_num(pdf)
  34. outlines = pdf.getOutlines()
  35. bookmarks_info = outlines_pg_zoom_info(outlines, pg_id_num_map)
  36. print(bookmarks_info)

注:我的书签是章节号(例如:1.1简介),我将书签信息Map到章节号。如果您的书签不同,请修改这部分代码:

  1. elif type(outlines) == pyPdf.pdf.Destination:
  2. title = outlines['/Title']
  3. result[title.split()[0]] = dict(title=outlines['/Title'], top=outlines['/Top'], \
  4. left=outlines['/Left'], page=(pg_id_num_map[outlines.page.idnum]+1))

展开查看全部
vjrehmav

vjrehmav2#

使用vjayky和Giulio D建议递归管理书签。
PyPDF2 >= v1.25

  1. from PyPDF2 import PdfFileReader
  2. def printBookmarksPageNumbers(pdf):
  3. def reviewAndPrintBookmarks(bookmarks, indent=0):
  4. for b in bookmarks:
  5. if type(b) == list:
  6. reviewAndPrintBookmarks(b, indent + 4)
  7. continue
  8. pg_num = pdf.getDestinationPageNumber(b) + 1 # page count starts from 0
  9. print("%s%s: Page %s" % (" " * indent, b.title, pg_num))
  10. reviewAndPrintBookmarks(pdf.getOutlines())
  11. with open('document.pdf', "rb") as f:
  12. pdf = PdfFileReader(f)
  13. printBookmarksPageNumbers(pdf)

字符串
PyPDF2 < v1.25

  1. from PyPDF2 import PdfFileReader
  2. def printBookmarksPageNumbers(pdf):
  3. # Map page ids to page numbers
  4. pg_id_to_num = {}
  5. for pg_num in range(0, pdf.getNumPages()):
  6. pg_id_to_num[pdf.getPage(pg_num).indirectRef.idnum] = pg_num
  7. def reviewAndPrintBookmarks(bookmarks, indent=0):
  8. for b in bookmarks:
  9. if type(b) == list:
  10. reviewAndPrintBookmarks(b, indent + 4)
  11. continue
  12. pg_num = pg_id_to_num[b.page.idnum] + 1 # page count starts from 0
  13. print("%s%s: Page %s" % (" " * indent, b.title, pg_num))
  14. reviewAndPrintBookmarks(pdf.getOutlines())
  15. with open('document.pdf', "rb") as f:
  16. pdf = PdfFileReader(f)
  17. printBookmarksPageNumbers(pdf)


示例输出(两种方法):

  1. Bookmark 1: Page 1
  2. Bookmark 1.1: Page 2
  3. Bookmark 1.2: Page 3
  4. Bookmark 2: Page 4
  5. Bookmark 3: Page 5
  6. Bookmark 3.1: Page 6

展开查看全部
iaqfqrcu

iaqfqrcu3#

在2019年,对于那些对更快的方式感兴趣的人来说,可以用途:

  1. from PyPDF2 import PdfFileReader
  2. def printPageNumberFrom(filename):
  3. with open(filename, "rb") as f:
  4. pdf = PdfFileReader(f)
  5. bookmarks = pdf.getOutlines()
  6. for b in bookmarks:
  7. print(pdf.getDestinationPageNumber(b) + 1) #page count starts from 0

字符串

ulmd4ohb

ulmd4ohb4#

我不确定,但根据pypdf.Destination的文档,书签的页码只是Destination.page。

相关问题