我正在创建一个脚本,从pdf文件中提取所有文本,然后处理它,所以我首先尝试删除一个文本来清除文档,但当我用python re.sub这样做时,它似乎只工作到一个特定的行。
你能帮帮我吗?
这是一个pdf文件https://fastupload.io/en/jzVkEoqzROsdLGs/file
这是密码
import re
from pypdf import PdfReader
from sys import exit
# Abrir el archivo PDF en modo de lectura binaria
reader = PdfReader("archivo.pdf")
texto_completo = ""
for page in reader.pages:
texto_completo += page.extract_text() + "\n"
print(texto_completo)
# Borrar texto IT Certification Guaranteed, The Easy Way! (nº pagina)
texto_completo = re.sub(r'(?s)(?=IT Certification Guaranteed, The Easy Way!)(.*?)(\d+)', r"",texto_completo,re.MULTILINE)
print(texto_completo)
#resultado
with open('resultado.txt', 'w') as res:
res.write(texto_completo)
exit()
1条答案
按热度按时间f0brbegy1#
已解决,我已在以下链接Bug in Python Regex? (re.sub with re.MULTILINE)中找到解决方案