pandas 标记化元组中的短语

h9a6wy2h  于 2022-12-28  发布在  其他
关注(0)|答案(1)|浏览(95)

我有一个由标记化的元组组成的数据集。我的预处理步骤是首先标记化单词,然后规范化俚语单词。但是俚语单词可能由白色的短语组成。我试图进行另一轮标记化,但是我找不到方法。下面是我的数据的一个例子。

firstTokenization                       normalized               secondTokenization 
0     [yes, no, cs]      [yes, no, customer service]     [yes, no, customer, service] 
1             [nlp]    [natural language processing]  [natural, language, processing] 
2         [no, yes]                        [no, yes]                        [no, yes]

我正在尝试找出一种生成 secondTokenization 列的方法。下面是我目前正在编写的代码...

tokenizer = MWETokenizer()
def tokenization (text):
    return tokenizer.tokenize(text.split())
df['firstTokenization'] = df['content'].apply(lambda x: tokenization(x.lower()))

normalizad_word = pd.read_excel('normalisasi.xlsx')
normalizad_word_dict = {}

for index, row in normalizad_word.iterrows():
    if row[0] not in normalizad_word_dict:
        normalizad_word_dict[row[0]] = row[1] 
def normalized_term(document):
    return [normalizad_word_dict[term] if term in normalizad_word_dict else term for term in document]
df['normalized'] = df['firstTokenization'].apply(normalized_term)
5f0d552i

5f0d552i1#

如果规范化列不包含嵌套列表,则此操作有效。
设置:

import pandas as pd
    
df = pd.DataFrame({'firstTokenization': [['yes', 'no', 'cs'],
                                         ['nlp'],
                                         ['no', 'yes']],
                   'normalized': [['yes', 'no', 'customer service'],
                                  ['natural language processing'],
                                  ['no', 'yes']],
                   })

print(df)

输出:

firstTokenization                     normalized
0     [yes, no, cs]    [yes, no, customer service]
1             [nlp]  [natural language processing]
2         [no, yes]                      [no, yes]

第一个apply在空间上拆分令牌,第二个apply取消嵌套列表,如answer所示。

df['secondTokenization'] = df['normalized'].apply(lambda x: [token.split(' ') for token in x]).apply(lambda y: [token for sublist in y for token in sublist])
print(df)

输出:

firstTokenization                     normalized  \
0     [yes, no, cs]    [yes, no, customer service]   
1             [nlp]  [natural language processing]   
2         [no, yes]                      [no, yes]   

                secondTokenization  
0     [yes, no, customer, service]  
1  [natural, language, processing]  
2                        [no, yes]

相关问题