regex 正则表达式：在.“”上拆分，但不在“J.K. Rowling”这样的子字符串中拆分

cfh9epnr 于 2023-01-27 发布在其他

关注(0)|答案(1)|浏览(92)

我在一堆文本中寻找书籍和作者的名称，如：

my_text = """
    My favorites books of all time are:
    Harry potter by J. K. Rowling, Dune (first book) by Frank Herbert;
    and Le Petit Prince by Antoine de Saint Exupery (I read it many times). That's it by the way.
"""

现在，我使用下面的代码在分隔符上拆分文本，如下所示：

pattern = r" *(.+) by ((?: ?\w+)+)"

matches = re.findall(pattern, my_text)

res = []
for match in matches:
    res.append((match[0], match[1]))

print(res) # [('Harry potter', 'J'), ('K. Rowling, Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery '), ("I read it many times). That's it", 'the way')]

即使有假阳性（如“顺便说一句，就是这样”），我的主要问题是作者在写首字母时被剪掉，这是很常见的。
我不知道如何允许像“J.K. Rowling”这样的首字母（或者像“J.K.Rowling”这样的前后没有空格的首字母）

regex

来源：https://stackoverflow.com/questions/75196453/regex-split-on-but-not-in-substrings-like-j-k-rowling

1条答案

按热度按时间

y53ybaqx1#

将模式更改为以下模式

pattern = r" *(.+) by ((?: ?[A-Z].?)+ ?(?:[A-Z][a-z]+)+)"

为了允许作者姓名的首字母，我们需要对模式做一些修改。首先，我们将在首字母后面添加一个可选的点，使用字符类“[A-Z]"，它匹配任何大写字母，后跟“”。（点）和“？”（问号）使其可选。接下来，我们将在点后添加一个可选空格“？”。接下来，我们将使用“+"为多个首字母重复该模式。
当我用我的模式试你的代码时，我得到：

('Harry potter', 'J. K. Rowling')

它似乎忽略了其余的作者，但它的作品与首字母的作者。让我知道，如果你想让我找出如何使它与首字母和非首字母，如果这有任何意义。
我在这里解决这个问题，花了一段时间：

import re

pattern = r" *(?:and )?(.+?) by ([A-Z](?:\.|\w)+(?: [A-Z](?:\.|\w)+)*)"
matches = re.finditer(pattern, my_text)

result = []
for match in matches:
    book_title = match.group(1)
    author = match.group(2)
    result.append((book_title, author))

print(result)

其将给予：

[('Harry potter', 'J. K. Rowling'), (', Dune (first book)', 'Frank Herbert'), ('Le Petit Prince', 'Antoine')]

赞(0）回复(0）举报 2023-01-27

我来回答

regex 正则表达式：在.“”上拆分，但不在“J.K. Rowling”这样的子字符串中拆分

1条答案

相关问题

热门标签

最新问答