删除pandas列中的标点符号,但保留原始列表结构

bnlyeluc  于 2023-04-10  发布在  其他
关注(0)|答案(2)|浏览(102)

我知道如何在单元格中为单个列表执行此操作,但我需要保持多个列表的结构,如[["I","need","to","remove","punctuations","."],[...],[...]]-〉[["I","need","to","remove","punctuations"],[...],[...]]
我知道的所有方法都变成了这个-〉["I","need","to","remove","punctuations",...]

data["clean_text"] = data["clean_text"].apply(lambda x: [', '.join([c for c in s if c not in string.punctuation]) for s in x])
data["clean_text"] = data["clean_text"].str.replace(r'[^\w\s]+', '')
...

最好的办法是什么?

klr1opcd

klr1opcd1#

按照你的方法,我只需要添加一个带有helper函数的 listcomp

import string

def clean_up(lst):
    return [[w for w in sublist if w not in string.punctuation] for sublist in lst]

data["clean_text"] = [clean_up(x) for x in data["text"]]

输出:

print(data) # -- with two different columns so we can see the difference

                                                                                                    text  \
0  [[I, need, to, remove, punctuations, .], [This, is, another, list, with, commas, ,, and, periods, .]]   

                                                                                     clean_text  
0  [[I, need, to, remove, punctuations], [This, is, another, list, with, commas, and, periods]]
carvr3hs

carvr3hs2#

如果你的dataframe不是很大,你可以尝试explode list to rows,然后过滤掉包含标点符号的行,最后group返回行。

df_ = df[['clean_text']].copy()

out = (df_.assign(g1=range(len(df)))
       .explode('clean_text', ignore_index=True)
       .explode('clean_text')
       .loc[lambda d: ~d['clean_text'].isin([',', '.'])]  # remove possible punctuation
       .groupby(level=0).agg({'clean_text': list, 'g1': 'first'})
       .groupby('g1').agg({'clean_text': list}))
print(df_)

                                                   clean_text
0  [[I, need, to, remove, punctuations, .], [Play, games, .]]

print(out)

                                             clean_text
g1
0   [[I, need, to, remove, punctuations], [Play, games]]

相关问题