regex 按标点符号(.！？;：)但不包括缩写

我想创建一个函数，它能够用点分割包含多个句子的字符串，但同时也能处理缩写。例如，它不应该在“Univ.”和“Dept.”之后分裂。这有点难以解释，但我会展示测试用例。我看过这篇文章（Split string with "." (dot) while handling abbreviations）但答案删除了非标点点（U.S.A.到USA），我想保留点
这是我的功能：

def split_string_by_punctuation(line: str) -> list[str]:
    """
    Splits a given string into a list of strings using terminal punctuation marks (., !, ?, or :) as delimiters.

    This function utilizes regular expression patterns to ensure that abbreviations, honorifics,
    and certain special cases are not considered as sentence delimiters.

    Args:
        line (str): The input string to be split into sentences.

    Returns:
        list: A list of strings representing the sentences obtained after splitting the input string.

    Notes:
        - Negative lookbehind is used to exclude abbreviations (e.g., "e.g.", "i.e.", "U.S.A."),
          which might have a period but are not the end of a sentence.
        - Negative lookbehind is also used to exclude honorifics (e.g., "Mr.", "Mrs.", "Dr.")
          that might have a period but are not the end of a sentence.
        - Negative lookbehind is also used to exclude some abbreviations (e.g., "Dept.", "Univ.", "et al.")
          that might have a period but are not the end of a sentence.
        - Positive lookbehind is used to match a whitespace character following a terminal
          punctuation mark (., !, ?, or :).
    """
    punct_regex = re.compile(r"(?<=[.!?;:])(?:(?<!Prof\.)|(?<!Dept\.)|(?<!Univ\.)|(?<!et\sal\.))(?<!\w\.\w.)(?<![A-Z][a-z]\.)\s")

    return re.split(punct_regex, line)

这些是我的测试案例：

class TestSplitStringByPunctuation(object):
    def test_split_string_by_punctuation_1(self):
        # Test case 1
        text1 = "I am studying at Univ. of California, Dept. of Computer Science. The research team includes " \
                "Prof. Smith, Dr. Johnson, and Ms. Adams et al. so we are working on a new project."
        result1 = split_string_by_punctuation(text1)
        assert result1 == ['I am studying at Univ. of California, Dept. of Computer Science.',
                           'The research team includes Prof. Smith, Dr. Johnson, and Ms. Adams et al. '
                           'so we are working on a new project.'], "Test case 1 failed"

    def test_split_string_by_punctuation_2(self):
        # Test case 2
        text2 = "This is a city in U.S.A.. This is i.e. one! What about this e.g. one? " \
                "Finally, here's the last one:"
        result2 = split_string_by_punctuation(text2)
        assert result2 == ['This is a city in U.S.A..', 'This is i.e. one!', 'What about this e.g. one?',
                           "Finally, here's the last one:"], "Test case 2 failed"

    def test_split_string_by_punctuation_3(self):
        # Test case 3
        text3 = "This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return as a single element list"
        result3 = split_string_by_punctuation(text3)
        assert result3 == [
            'This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return '
            'as a single element list'], "Test case 3 failed"

例如，测试用例1的结果是['I am studying at Univ.'，'of加州，Dept.'，'of Computer Science .'，'The research team includes Prof.'，'Smith，Dr.Json，and Ms.亚当斯et al.'，'so we are working on a new project.']，它将字符串拆分为“Univ.”、“Dept.”、“Prof.”和“et al.”。

我建议使用findall来捕获句子，而不是split来识别句子中断。
其他一些意见：

当您将regex对象作为 argument 传递给re.split时，使用re.compile会产生相反的效果（或任何其他re方法），因为然后它会被再次编译。相反，你应该在regex对象上调用该方法，如punct_regex.split(line)。然而，由于这个regex只使用一次，你可能会跳过对compile的调用。编译将发生在re方法调用上。
列出所有可能的缩写将是一项永无止境的任务！除非你确定你抓住了所有的缩写，否则我会建议一个启发式：如果一个点后面没有白色和大写字母，前面的单词是一个缩写。如果这个单词的第一个字母是大写字母，最多有4个字母，后面有一个点，它也是一个缩写。在所有其他情况下，点被解释为结束一个句子。
您的测试用例中有一些错误。

修复测试用例后，此函数通过了测试：

def split_string_by_punctuation(line):
    punct_regex = r"(?=\S)(?:[A-Z][a-z]{0,3}\.|[^.?!;:]|\.(?!\s+[A-Z]))*.?"
    return re.findall(punct_regex, line)

说明：

(?=\S)：Assert任何匹配的第一个字符不是白色
(?: | | )*：三种交替模式的非捕获组。可以重复0次或多次。
[A-Z][a-z]{0,3}\.：备选方案之一：一个大写字母，最多三个小写字母，然后是一个点。
[^.?!;:]：备选方案之一：不是.?!;:之一的字符。
\.(?!\s+[A-Z])：后面没有白色和大写字母的点。
.?：任何字符--如果还有一个。如果有一个，我们知道它是.?!;:中的一个（否则仍然会使用上面的第二个选择）。如果没有，我们在输入的末尾。

注意：一个非捕获组仍然 * 匹配 * 文本，它只是不能被反向引用。“捕获”一词指的是为它创建一个组，而不是“匹配”。

regex 按标点符号(.！？;：)但不包括缩写

1条答案

说明：

相关问题

热门标签

最新问答