regex 如何从文本文件中提取含有引文标记的句子

fumotvh3  于 2023-06-25  发布在  其他
关注(0)|答案(2)|浏览(116)

例如,我有3个句子,如下面其中一个句子在中间包含引用标记(Warren and Pereira, 1982)。引文总是在括号中,格式如下:(~字符串~逗号(,)~空格~数字~)
他住在Nidarvoll,今晚我必须在6点钟赶去奥斯陆的火车。该系统被称为BusTUC,是建立在经典系统CHAT-80(Warren和Pereira,1982)的基础上。CHAT-80是一个最先进的自然语言系统,其自身的优点令人印象深刻。
我使用Regex只提取中间的句子,但它会打印所有的3个句子。结果应该是这样的:
该系统称为BusTUC,是建立在经典系统CHAT-80(Warren和Pereira,1982)的基础上。

atmip9wb

atmip9wb1#

这个圈套... 2个句子代表关注的病例:

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."

首先,在引文位于句子结尾的情况下进行匹配:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"

当引文不在句末时匹配:

p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"

将这两种情况与“|'正则表达式运算符:

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")

运行:

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]

在这两种情况下,你得到的句子与引文。
一个很好的资源是python正则表达式documentation和附带的regex howto页面。
干杯

gcuhipw9

gcuhipw92#

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

你可以将文本分成一系列句子,然后选择以“)”结尾的句子。

sentences = text.split(".")[:-1]

for sentence in sentences:
    if sentence[-1] == ")":
        print sentence

相关问题