regex Pandas findall by pattern but not duplicated ones [duplicate]

beq87vna  于 9个月前  发布在  其他
关注(0)|答案(2)|浏览(72)

此问题已在此处有答案

Is there a way in pandas to remove duplicates from within a series?(5个答案)
6天前关闭
我需要一份所有非重复正则表达式匹配的列表。
考虑下面的框架:

Letter      Actions
r1          a30,a30
r2          a30,a12-rf,a15,a15
r3          0
r4          a10,a93
r5          a13

我期望:

Letter      Actions
r1          ['a30']
r2          ['a30','a12','a15']
r3          0
r4          ['a10','a93']
r5          ['a13']

我有下面,但它返回所有的模式匹配,而我不需要重复的:

import pandas as pd

df = pd.DataFrame(
    [['r1', 'a30,a30'],
     ['r2', 'a30,a12-rf,a15,a15'],
     ['r3', '0'],
     ['r4', 'a10,a93'],
     ['r5', 'a13']],
    columns=['Letter', 'Actions'])

df['Action_list'] = df['Actions'].str.findall(r'([a]\d{2})')

有一个答案here,但我需要比使用lambda更快的东西。

4nkexdtk

4nkexdtk1#

您可以使用set删除重复项:

mask = df["Actions"].str.contains(r"a\d+", regex=True)

df["new_Actions"] = np.where(
    mask, df["Actions"].str.findall(r"a\d+").apply(set).apply(list), df["Actions"]
)
print(df)

图纸:

Letter             Actions      new_Actions
0     r1             a30,a30            [a30]
1     r2  a30,a12-rf,a15,a15  [a30, a15, a12]
2     r3                   0                0
3     r4             a10,a93       [a93, a10]
4     r5                 a13            [a13]
8xiog9wr

8xiog9wr2#

验证码

str.findall后应用集列表(& L)

df['Actions'] = df['Actions'].str.findall(r'(a\d*)')\
                  .apply(lambda x: list(set(x)))\
                  .where(lambda x: x.astype('bool'), df['Actions'])

产出:

Letter  Actions
0   r1  [a30]
1   r2  [a30, a15, a12]
2   r3  0
3   r4  [a93, a10]
4   r5  [a13]

示例代码

import pandas as pd
data1 = {'Letter': ['r1', 'r2', 'r3', 'r4', 'r5'], 
         'Actions': ['a30,a30', 'a30,a12-rf,a15,a15', '0', 'a10,a93', 'a13']}
df = pd.DataFrame(data1)

相关问题