regex 如何在python中使用正则表达式根据标点符号拆分文章

8iwquhpp 于 2023-01-27 发布在 Python

关注(0)|答案(2)|浏览(170)

我需要用标点符号把文章分成几个句子，我使用下面的正则表达式：

re.split(r'[,|.|?|!]', strContent)

它确实起作用，但有一个问题，它会将以下不应该拆分的拉丁名称（如G. lucid）分隔开：

Many studies to date have described the anticancer properties of G. lucidum,

这个拉丁名字的缩写是一个大写字母，后面跟着一个点和一个空格。所以我试着把上面的正则表达式修改如下：

re.split(r'[,|(?:[^A-Z].)|?|!]', strContent)

但是，收到以下错误提示：

re.error: unbalanced parenthesis

如何修改这个正则表达式？

regex

来源：https://stackoverflow.com/questions/75253180/how-to-use-regular-expressions-in-python-to-split-articles-based-on-punctuation

2条答案

按热度按时间

kknvjkwl1#

您应该使用negative lookbehind，并将其放在与句尾匹配的字符集 * 之前 *。
负向后查找应该匹配只有一个大写字母的单词，这可以通过用\b匹配字母前的单词边界来实现。
字符集内也不需要|，它用于匹配其他模式。

re.split(r'(?<!\b[A-Z])[,.?!]', strContent)

赞(0）回复(0）举报 2023-01-27

rsaldnfx2#

使用纯正则表达式查找完整的句子是困难的，因为有一些边缘情况，比如缩写，你已经看到了。你应该使用NLP库，比如NLTK。

from nltk.tokenize import sent_tokenize
text = "Many studies to date have described the anticancer properties of G. lucidum.  The studies are vast."
print(sent_tokenize(text))

# ['Many studies to date have described the anticancer properties of G. lucidum.', 'The studies are vast.']

赞(0）回复(0）举报 2023-01-27

我来回答

regex 如何在python中使用正则表达式根据标点符号拆分文章

2条答案

相关问题

热门标签

最新问答