regex 正则表达式和PyMuPdf，python [已关闭]

g6ll5ycj 于 2022-12-27 发布在 Python

关注(0)|答案(1)|浏览(128)

**已关闭。**此问题需要debugging details。当前不接受答案。

编辑问题以包含desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem。这将有助于其他人回答问题。
昨天关门了。
Improve this question
我如何找到模式，然后验证模式下一次出现之间的第一个元素是文本还是图像，然后，如果第一个元素是文本，则提取文本，如果是图像，则提取图像？
正则表达式为：

(r'QUESTÃO \d+')

我尝试使用split（）函数，但是它不显示PDF中是否有图像。我尝试识别PDF中的第一个元素是文本还是PDF，例如“QUESTú01”和“QUESTú02”之间的元素。

regex

来源：https://stackoverflow.com/questions/74868979/regex-and-pymupdf-python

1条答案

按热度按时间

t2a7ltrp1#

完全忘记正则表达式！
PyMuPDF只需要一个方法就可以提取页面的文本和图像：

doc = fitz.open("your.pdf")  # open your file

page = doc[0]  # example: first page

blocks = page.get_text("dict")["blocks"]  # combined text / image extraction

for block in blocks:  # each block is either text or an image
    if block["type"] == 1:  # this is an image
        # do something with the image
    else:  # this is a text block
        for line in block["lines"]:
            for span in line["spans"]:
                print(span["text"])  # do something with the piece of text

***图像块***是具有完整图像 meta信息、页面上的位置以及二进制图像数据的字典。这可以例如用于将图像保存在常规图像文件中。

文本块可以被认为是一个文本段落。它由行组成，每行由一个或多个文本“跨度”组成。所有这些信息块都带有位置信息，书写方向，字体信息和文本颜色。
get_text()方法非常灵活，支持的输出格式不仅仅是“dict”。其他的选择包括HTML / XML输出或详细到每个字符的信息。您还可以取消选择图像提取，或将提取限制在页面矩形的一部分，等等。

赞(0）回复(0）举报 2022-12-27

我来回答

regex 正则表达式和PyMuPdf，python [已关闭]

1条答案

相关问题

热门标签

最新问答