我想创建一个函数,它能够用点分割包含多个句子的字符串,但同时也能处理缩写。例如,它不应该在“Univ.”和“Dept.”之后分裂。这有点难以解释,但我会展示测试用例。我看过这篇文章(Split string with "." (dot) while handling abbreviations)但答案删除了非标点点(U.S.A.到USA),我想保留点
这是我的功能:
def split_string_by_punctuation(line: str) -> list[str]:
"""
Splits a given string into a list of strings using terminal punctuation marks (., !, ?, or :) as delimiters.
This function utilizes regular expression patterns to ensure that abbreviations, honorifics,
and certain special cases are not considered as sentence delimiters.
Args:
line (str): The input string to be split into sentences.
Returns:
list: A list of strings representing the sentences obtained after splitting the input string.
Notes:
- Negative lookbehind is used to exclude abbreviations (e.g., "e.g.", "i.e.", "U.S.A."),
which might have a period but are not the end of a sentence.
- Negative lookbehind is also used to exclude honorifics (e.g., "Mr.", "Mrs.", "Dr.")
that might have a period but are not the end of a sentence.
- Negative lookbehind is also used to exclude some abbreviations (e.g., "Dept.", "Univ.", "et al.")
that might have a period but are not the end of a sentence.
- Positive lookbehind is used to match a whitespace character following a terminal
punctuation mark (., !, ?, or :).
"""
punct_regex = re.compile(r"(?<=[.!?;:])(?:(?<!Prof\.)|(?<!Dept\.)|(?<!Univ\.)|(?<!et\sal\.))(?<!\w\.\w.)(?<![A-Z][a-z]\.)\s")
return re.split(punct_regex, line)
这些是我的测试案例:
class TestSplitStringByPunctuation(object):
def test_split_string_by_punctuation_1(self):
# Test case 1
text1 = "I am studying at Univ. of California, Dept. of Computer Science. The research team includes " \
"Prof. Smith, Dr. Johnson, and Ms. Adams et al. so we are working on a new project."
result1 = split_string_by_punctuation(text1)
assert result1 == ['I am studying at Univ. of California, Dept. of Computer Science.',
'The research team includes Prof. Smith, Dr. Johnson, and Ms. Adams et al. '
'so we are working on a new project.'], "Test case 1 failed"
def test_split_string_by_punctuation_2(self):
# Test case 2
text2 = "This is a city in U.S.A.. This is i.e. one! What about this e.g. one? " \
"Finally, here's the last one:"
result2 = split_string_by_punctuation(text2)
assert result2 == ['This is a city in U.S.A..', 'This is i.e. one!', 'What about this e.g. one?',
"Finally, here's the last one:"], "Test case 2 failed"
def test_split_string_by_punctuation_3(self):
# Test case 3
text3 = "This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return as a single element list"
result3 = split_string_by_punctuation(text3)
assert result3 == [
'This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return '
'as a single element list'], "Test case 3 failed"
例如,测试用例1的结果是['I am studying at Univ.','of加州,Dept.','of Computer Science .','The research team includes Prof.','Smith,Dr.Json,and Ms.亚当斯et al.','so we are working on a new project.'],它将字符串拆分为“Univ.”、“Dept.”、“Prof.”和“et al.”。
1条答案
按热度按时间hjzp0vay1#
我建议使用
findall
来捕获句子,而不是split
来识别句子中断。其他一些意见:
re.split
时,使用re.compile
会产生相反的效果(或任何其他re
方法),因为然后它会被再次编译。相反,你应该在regex对象上调用该方法,如punct_regex.split(line)
。然而,由于这个regex只使用一次,你可能会跳过对compile
的调用。编译将发生在re
方法调用上。修复测试用例后,此函数通过了测试:
说明:
(?=\S)
:Assert任何匹配的第一个字符不是白色(?: | | )*
:三种交替模式的非捕获组。可以重复0次或多次。[A-Z][a-z]{0,3}\.
:备选方案之一:一个大写字母,最多三个小写字母,然后是一个点。[^.?!;:]
:备选方案之一:不是.?!;:
之一的字符。\.(?!\s+[A-Z])
:后面没有白色和大写字母的点。.?
:任何字符--如果还有一个。如果有一个,我们知道它是.?!;:
中的一个(否则仍然会使用上面的第二个选择)。如果没有,我们在输入的末尾。注意:一个非捕获组仍然 * 匹配 * 文本,它只是不能被反向引用。“捕获”一词指的是为它创建一个组,而不是“匹配”。