python 正则表达式：清理文本：删除到某一行的所有内容

e5nszbig 于 2022-12-25 发布在 Python

关注(0)|答案(2)|浏览(181)

我有一个包含麦克白悲剧的文本文件。我想清理它，第一步是删除The Tragedie of Macbeth行之前的所有内容，并将剩余部分存储在removed_intro_file中。
我试过：

import re
filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
with open(filename, 'r') as file:
    removed_intro = file.read()
    with open('removed_intro_file', 'w') as output:
        removed = re.sub(title, '', removed_intro)
        print(removed)
        output.write(removed)

print语句不打印任何东西，所以它不匹配任何东西。我如何在几行上使用regex？应该使用指向要删除的行的开始和结束的指针吗？我也很高兴知道是否有更好的方法来解决这个问题，也许不使用regex。

python

来源：https://stackoverflow.com/questions/74906747/regex-cleaning-text-remove-everything-upto-a-certain-line

2条答案

按热度按时间

kmbjn2e31#

我们可以尝试逐行阅读文件，直到到达目标行。然后，将所有后续行读入输出文件。

filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
line = ""
with open(filename, 'r') as file:
    while line != title:                 # discard all lines before the Macbeth title
        line = file.readline()
    lines = '\n'.join(file.readlines())  # read all remaining lines
    with open('removed_intro_file', 'w') as output:
        output.write(title + "\n" + lines)

这种方法可能比使用正则表达式方法更快、更有效。

赞(0）回复(0）举报 2022-12-25

nxowjjhe2#

您的正则表达式只将title替换为'';你想删除标题和它前面的所有文本，所以搜索从字符串开头到标题的所有字符（包括换行符）;这应该工作（我只测试了我写的一个样本文件）：

removed = re.sub(r'(?s)^.*'+re.escape(title), '', removed_intro)

赞(0）回复(0）举报 2022-12-25

我来回答

python 正则表达式：清理文本：删除到某一行的所有内容

2条答案

相关问题

热门标签

最新问答