regex 为什么当把一个字符串拆分成一系列子字符串时，如果不去掉分隔符，那么原来字符串的一部分会在拆分过程中丢失？

import re
from itertools import chain

def identification_of_nominal_complements(input_text):

    pat_identifier_noun_with_modifiers = r"((?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))"
    substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
    separator_elements = r"\s*(?:,|(,|)\s*y)\s*"

    substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
    substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
    substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != '', substrings_with_nouns_and_their_modifiers_list))
    print(substrings_with_nouns_and_their_modifiers_list) # --> list output

    pat = re.compile(rf"(?<!\(PERS\))({'|'.join(substrings_with_nouns_and_their_modifiers_list)})(?!['\w)-])")
    input_text = re.sub(pat, r'((PERS)\1)', input_text)

    return input_text

#example 1, it works well:
input_text = "He ((VERB)visto) la maceta de la señora de rojo ((VERB)es) grande. He ((VERB)visto) que la maceta de la señora de rojo y a ((PERS)Lucila) ((VERB)es) grande."

#example 2, it works wrong and gives error:
input_text = "((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi"

input_text = identification_of_nominal_complements(input_text)
print(input_text) # --> string output

为什么example 2的这个函数会去掉substrings_with_nouns_and_their_modifiers_list列表中某些元素的((PERS)部分，而在example 1中这个函数却不会呢？由于这个原因，生成的元素带有不对称的括号，这会在后面生成一个re.error: unbalanced parenthesis，特别是在使用re.compile()函数的行上。
对于example 1，获得的输出是正确的，它们没有被不必要地去除((PERS)，因此没有获得不平衡括号的误差

['la maceta de la señora de rojo', 'la maceta de la señora de rojo', 'a ((PERS)Lucila)']

'He ((VERB)visto) ((PERS)la maceta de la señora de rojo) ((VERB)es) grande. He ((VERB)visto) que ((PERS)la maceta de la señora de rojo) y a ((PERS)Lucila) ((VERB)es) grande.'

在example 2中，问题就出在这里，虽然处理字符串的函数是相同的，但是由于某种原因，子字符串((PERS)从substrings_with_nouns_and_their_modifiers_list列表的某些元素中被删除，这将在使用re.compile()时触发不平衡括号错误，因为在这个特定的情况下，有些子字符串包含)，但不包含(，因为((PERS)已被删除

['los viejos gabinetes)', 'los viejos gabinetes)', 'los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', 'los candelabros) son brillantes los candelabros', 'los candelabros)']

Traceback (most recent call last):
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(substrings_with_nouns_and_their_modifiers_list)})(?!['\w)-])")
raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 56

如果identification_of_nominal_complements()函数工作正常，这些应该是从example 2向函数发送字符串时得到的输出，其中不删除一些((PERS)可以避免使用re.compile()时出现不对称括号错误。

['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']

'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'

我应该在identification_of_nominal_complements()函数中修改什么，以便在发送example 2字符串时不会出现不平衡括号错误，并且可以获得正确的输出

“为什么示例2中的这个函数会去掉某些元素的（（PERS）部分......”因为开头没有模式[^\s]*：

pat_identifier_noun_with_modifiers = r"([^\s]*(?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))"

现在的结果是：

['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']

'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'

regex 为什么当把一个字符串拆分成一系列子字符串时，如果不去掉分隔符，那么原来字符串的一部分会在拆分过程中丢失？

1条答案

相关问题

热门标签

最新问答