regex 基于以大写字母开头的句子提取字符串?

axr492tv  于 2023-08-08  发布在  其他
关注(0)|答案(2)|浏览(136)

我有以下字符串:

JOHN SMITH, YOUTUBE: 
I'm having a great day today 
JANE DOE, GOOGLE:
I'm going to the gym later 
STEVE SMITH, FACEBOOK:
Time for people to speak 
SCHMEFF SCHMEZOS, JUNGLE:
Buy something from my online shop. You might like it

字符串
您可以在此处创建字符串:

string =     """JOHN SMITH, YOUTUBE: 
    I'm having a great day today 
    JANE DOE, GOOGLE:
    I'm going to the gym later 
    STEVE SMITH, FACEBOOK:
    Time for people to speak 
    SCHMEFF SCHMEZOS, JUNGLE:
    Buy something from my online shop. You might like it"""


加载python包:第一个月
我试图找到一个正则表达式来根据说话者和他们的句子来分割文本,例如,我试图得到以下摘录:

string1: JOHN SMITH, YOUTUBE>> 
I'm having a great day today 

string2: JANE DOE, GOOGLE>> 
I'm going to the gym later 

string3: STEVE SMITH, FACEBOOK>>
Time for people to speak 

string4: SCHMEFF SCHMEZOS, JUNGLE>> 
Buy something from my online shop. You might like it


字符串可以跨越多行,所以我尝试捕获两组,总是有冒号的扬声器:后面有名字(有时有一个空格,所以\s),他们的讲话可以在几行。
我试图捕捉到下一个发言者,当前的正则表达式是这样的:

(^[A-Z].*):\s*\n*(?=(?:[A-Z]|$))


名字总是大写字母,并开始当一个新的发言者说话,任何帮助表示感谢。
我正在使用Python 3.9
新样本字符串:

JOHN SMITH, GLOBAL HEAD OF YOUTUBE : Good morning, good 
afternoon, everyone . Before I hand over to facebook, I want to give a quick reminder of the reporting 
changes that have taken effect this filming of a tv show.  
 

 
BOBBY DUDE, GROUP FROM FACEBOOK:     Thanks, john smith lets talk about movies and films we watch when we are bored parents.

bbuxkriu

bbuxkriu1#

我们可以在这里使用re.findall作为正则表达式选项:

import re

inp = """JOHN SMITH, YOUTUBE: 
I'm having a great day today 
JANE DOE, GOOGLE:
I'm going to the gym later 
STEVE SMITH, FACEBOOK:
Time for people to speak 
SCHMEFF SCHMEZOS, JUNGLE:
Buy something from my online shop. You might like it"""
matches = re.findall(r'([A-Z]+(?: [A-Z]+)+, [A-Z]+):\n(.*?)(?=[A-Z]{2,}|$)', inp, flags=re.S)
print(matches)

[('JANE DOE, GOOGLE', "I'm going to the gym later \n"),
 ('STEVE SMITH, FACEBOOK', 'Time for people to speak \n'),
 ('SCHMEFF SCHMEZOS, JUNGLE', 'Buy something from my online shop. You might like it')]

字符串

cnwbcb6i

cnwbcb6i2#

使用re.split,对一个空格序列进行拆分,该序列以一个新行结尾,后跟一个大写字母,直到:为止没有小写字母。

import re

test = '''JOHN SMITH, YOUTUBE: 
I'm having a great day today 
JANE DOE, GOOGLE:
I'm going to the gym later 
STEVE SMITH, FACEBOOK:
Time for people to speak 
SCHMEFF SCHMEZOS, JUNGLE:
Buy something from my online shop. You might like it
JOHN SMITH, GLOBAL HEAD OF YOUTUBE : Good morning, good 
afternoon, everyone . Before I hand over to facebook, I want to give a quick reminder of the reporting 
changes that have taken effect this filming of a tv show.  
 
BOBBY DUDE, GROUP FROM FACEBOOK:     Thanks, john smith lets talk about movies and films we watch when we are bored parents.  '''

result = re.split(r'(?m)\s+(?=^[A-Z][^a-z:]*:)', test)

字符串

相关问题