使用python从pdf中提取特定文本

nbysray5 于 2023-04-28 发布在 Python

关注(0)|答案(2)|浏览(145)

如何使用Python从PDF中提取特定文本？
例如：Pdf contain（名称：Python，颜色：在这种情况下，我想提取“Name：”之后的任何文本，而不提取“Python”和“Color”之间的“，”之后的任何文本。
任何帮助都很感激。

import PyPDF2 

pdf = open("C:\\Users\\ME\\Desktop\\test.pdf)

reader = PyPDF2.PdfReader(pdf)

page = reader.pages[0]

print(page.extract_text())

这将提取整个PDF。

python

来源：https://stackoverflow.com/questions/76110821/extract-specific-text-from-pdf-using-python

2条答案

按热度按时间

nr9pn0ug1#

如果你的库返回一个字符串，你可以使用正则表达式来找到你想要的输出：

import re

text = "Name: Python , Color: Blue"
span = re.match("Name:.*,", text).span()
# Add 5 to starting position to remove "Name:"
print(text[span[0]+5:span[1]])

赞(0）回复(0）举报 2023-04-28

hpcdzsge2#

使用PyMuPDF包尝试此操作。

import fitz  # PyMuPDF
doc=fitz.open("test.pdf")
page = doc[0]

blocks = page.get_text("blocks")  # extract text separated by paragraphs

# a block is a tuple starting with 4 floats followed by lines in paragraph
for b in blocks:
    lines = b[4].splitlines()  # lines in the paragraph
    for line in lines:  # look for lines having 'Name:' and 'Color:'
        p1 = line.find("Name:")
        if p1 < 0:
            continue
        p2 = line.fine("Color:", p1)
        if p2 < 0:
            continue
        text = line[p1+5:p2]  # all text in between
        p3 = text.find(",")  # find any comma
        if p3 >= 0:  # there, shorten text accordingly
            text = text[:p3]
        # finished

赞(0）回复(0）举报 2023-04-28

我来回答

使用python从pdf中提取特定文本

2条答案

相关问题

热门标签

最新问答