regex 匹配列表中的多个字符串匹配项,并为每个匹配项创建一个新行

nzk0hqpo  于 2023-10-22  发布在  其他
关注(0)|答案(3)|浏览(133)

我有一个数据框,其中一列中有文本,我使用正则表达式格式的字符串来查看是否可以从三个列表中找到任何匹配项。但是,当列表1中有多个匹配项时,我想为每个匹配项创建一个重复列。需要注意的是,匹配必须是连续的,列表list_2和list_3中的元素是可选的。
我下面有一个例子,我希望所需的输出。

  1. list_1 = ['chest', 'test', 'west', 'nest']
  2. list_2 = ['mike', 'bike', 'like', 'pike']
  3. list_3 = ['hay', 'day', 'may', 'say']

样品DF:
| 文本|匹配_1| match_2| match_3|
| --|--|--|--|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|胸部|自行车|天|
| 自行车可以骑自行车,|巢|楠|楠|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|测试|像|干草|
期望输出:
| 文本|匹配_1| match_2| match_3|
| --|--|--|--|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|胸部|自行车|天|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|测试|迈克|楠|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|西|楠|楠|
| 自行车可以骑自行车,|巢|楠|楠|
| 自行车可以骑自行车,|巢|自行车|可以|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|测试|像|干草|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|西|楠|楠|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|西|像|楠|
我希望我上面的描述不会太混乱。我的当前方法无法匹配来自list_1的多个匹配项(如上面的示例所示),而来自list_2和list_3的可选匹配项是连续的。
感谢您的所有努力!

yhived7q

yhived7q1#

你可以从你的单词列表中编程构建一个正则表达式,使用嵌套的可选部分来允许可能缺少的第二个,第三个等。匹配:

  1. list_1 = ['chest', 'test', 'west', 'nest']
  2. list_2 = ['mike', 'bike', 'like', 'pike']
  3. list_3 = ['hay', 'day', 'may', 'say']
  4. word_list = [list_1, list_2, list_3]
  5. pattern = r'\b' + r'(?:\b\s+'.join(fr"(?P<match_{i+1}>{'|'.join(w)})" for i, w in enumerate(word_list)) + r'\b' + ''.join(')?' for _ in range(1, len(word_list)))

对于您的样本数据,这将提供:

  1. \b(?P<match_1>chest|test|west|nest)(?:\b\s+(?P<match_2>mike|bike|like|pike)(?:\b\s+(?P<match_3>hay|day|may|say)\b)?)?

你可以在regex101上看到这一点。
然后,您可以将该正则表达式与extractall一起使用,以查找每个文本值中的所有匹配项,并将该结果连接回原始列。

  1. out = df[['text']].join(
  2. df['text'].str.extractall(pattern)
  3. .droplevel(1)
  4. ).reset_index(drop=True)

对于您的示例数据,它给出以下结果:

  1. text match_1 match_2 match_3
  2. 0 zzz zzz zz chest bike day zzzz z test mike zz... chest bike day
  3. 1 zzz zzz zz chest bike day zzzz z test mike zz... test mike NaN
  4. 2 zzz zzz zz chest bike day zzzz z test mike zz... west NaN NaN
  5. 3 aaa aa aaa a nest aa aaaa aaa nest bike may a... nest NaN NaN
  6. 4 aaa aa aaa a nest aa aaaa aaa nest bike may a... nest bike may
  7. 5 ggg gg ggg ggg ggg test like hay ggg gg west ... test like hay
  8. 6 ggg gg ggg ggg ggg test like hay ggg gg west ... west NaN NaN
  9. 7 ggg gg ggg ggg ggg test like hay ggg gg west ... west like NaN

请注意,使用变量list_1list_2不是一个好的编程实践,你应该使用一个列表的列表(像上面的word_list)。

展开查看全部
hof1towb

hof1towb2#

示例

  1. import pandas as pd
  2. data1 = {'text': [' zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz ',
  3. ' aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa ',
  4. ' ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like ']}
  5. df1 = pd.DataFrame(data1)

df1

  1. text
  2. 0 zzz zzz zz chest bike day zzzz z test mike zz...
  3. 1 aaa aa aaa a nest aa aaaa aaa nest bike may a...
  4. 2 ggg gg ggg ggg ggg test like hay ggg gg west ...

步骤1

首先制作图案

  1. pat_list = ['(?P<match_{}>{})'.format(i, '|'.join(globals()["list_%i" % i])) for i in range(1, 4)]

pat_list

  1. ['(?P<match_1>chest|test|west|nest)',
  2. '(?P<match_2>mike|bike|like|pike)',
  3. '(?P<match_3>hay|day|may|say)']

我使用了for循环来提取要提取到list_1、list_2和list_3中的值。如果它与您的机制不同,您也可以手动创建。

第二步

接下来,从df1的“text”列中提取模式,并将生成的DataFrame定义为df2。

  1. df2 = df1['text'].str.extractall('|'.join(pat_list)).droplevel(1)

DF2

  1. match_1 match_2 match_3
  2. 0 chest NaN NaN
  3. 0 NaN bike NaN
  4. 0 NaN NaN day
  5. 0 test NaN NaN
  6. 0 NaN mike NaN
  7. 0 west NaN NaN
  8. 1 nest NaN NaN
  9. 1 nest NaN NaN
  10. 1 NaN bike NaN
  11. 1 NaN NaN may
  12. 2 test NaN NaN
  13. 2 NaN like NaN
  14. 2 NaN NaN hay
  15. 2 west NaN NaN
  16. 2 west NaN NaN
  17. 2 NaN like NaN

步骤3

按顺序压缩df2的值并与df1连接。

  1. grp = df2['match_1'].notna().groupby(df2.index).cumsum()
  2. df3 = df2.groupby([df2.index, grp]).first().droplevel(1)
  3. out = df1[['text']].join(df3)

出来

  1. text match_1 match_2 match_3
  2. 0 zzz zzz zz chest bike day zzzz z test mike zz... chest bike day
  3. 0 zzz zzz zz chest bike day zzzz z test mike zz... test mike None
  4. 0 zzz zzz zz chest bike day zzzz z test mike zz... west None None
  5. 1 aaa aa aaa a nest aa aaaa aaa nest bike may a... nest None None
  6. 1 aaa aa aaa a nest aa aaaa aaa nest bike may a... nest bike may
  7. 2 ggg gg ggg ggg ggg test like hay ggg gg west ... test like hay
  8. 2 ggg gg ggg ggg ggg test like hay ggg gg west ... west None None
  9. 2 ggg gg ggg ggg ggg test like hay ggg gg west ... west like None
展开查看全部
l2osamch

l2osamch3#

您可以通过编程方式创建一个正则表达式来与str.extractall一起使用:

  1. lists = [list_1, list_2, list_3]
  2. pats = [f"(?P<match_{i}>{'|'.join(l)})" for i, l in enumerate(lists, start=1)]
  3. # ['(?P<match_1>chest|test|west|nest)',
  4. # '(?P<match_2>mike|bike|like|pike)',
  5. # '(?P<match_3>hay|day|may|say)']
  6. pat = pats[-1]
  7. for p in pats[-2::-1]:
  8. pat = f'{p}(?: +{pat})?'
  9. # '(?P<match_1>chest|test|west|nest)(?: +(?P<match_2>mike|bike|like|pike)(?: +(?P<match_3>hay|day|may|say))?)?'
  10. out = df['text'].str.extractall(pat).droplevel(1)

输出量:

  1. match_1 match_2 match_3
  2. 0 chest bike day
  3. 0 test mike NaN
  4. 0 west NaN NaN
  5. 1 nest NaN NaN
  6. 1 nest bike may
  7. 2 test like hay
  8. 2 west NaN NaN
  9. 2 west like NaN

regex demo
要将结果连接到原始DataFrame,请执行以下操作:

  1. out = df.join(df['text'].str.extractall(pat).droplevel(1))

输出量:

  1. text match_1 match_2 match_3
  2. 0 zzz zzz zz chest bike day zzzz z test mike zz... chest bike day
  3. 0 zzz zzz zz chest bike day zzzz z test mike zz... test mike NaN
  4. 0 zzz zzz zz chest bike day zzzz z test mike zz... west NaN NaN
  5. 1 aaa aa aaa a nest aa aaaa aaa nest bike may a... nest NaN NaN
  6. 1 aaa aa aaa a nest aa aaaa aaa nest bike may a... nest bike may
  7. 2 ggg gg ggg ggg ggg test like hay ggg gg west ... test like hay
  8. 2 ggg gg ggg ggg ggg test like hay ggg gg west ... west NaN NaN
  9. 2 ggg gg ggg ggg ggg test like hay ggg gg west ... west like NaN
展开查看全部

相关问题