pandas 如何连接两个单词,如果它们前面是 Dataframe 中句子中的某些单词

eh57zj3b  于 2023-02-17  发布在  其他
关注(0)|答案(1)|浏览(134)

我有一个包含否定词的推文列表,比如"不,从不,很少"
我想将"not nice"转换为"not_nice"(用下划线分隔)。
我怎样才能把推文中所有的"不"字和它们后面的话连接起来呢?
我试着这样做,但它没有改变任何东西,句子保持不变,没有改变

def combine(negation_words, word_scan):
    if type(negation_words) != list:
        negation_words = [negation_words]  
    n_index = []
    
    for i in negation_words:
        index_replace = [(m.end(0)) for m in re.finditer(i,word_scan)]
        n_index += index_replace
    for rep in n_index:
        letters = [x for x in word_scan]
        letters[rep] = "_"
        word_scan = "".join(letters)
    return word_scan
negation_words = ["no", "not"]
word_scan = df
combine(negation_words, word_scan)
df['clean'] = df['tweets'].apply(lambda x: combine(str(x), word_scan))
df
oalqel3c

oalqel3c1#

您可以使用re.subSeries.str.replace与regex来查找negation_words列表中后跟空格的任何单词,并将其替换为下划线。

import re

negation_words = ["no", "not"]

escaped_words = "|".join(re.escape(word) for word in negation_words)
print(repr(escaped_words))
# 'no|not'

regex = fr"({escaped_words})\s+"
print(repr(regex))
# '(no|not)\\s+'

现在,用case=False调用Series.str.replace,以进行不区分大小写的匹配:

df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})

df['clean'] = df['tweets'].str.replace(regex, r'\1_', case=False)

其给出:

tweets              clean
0    this is a tweet    this is a tweet
1        No tweeting        No_tweeting
2                 no                 no
3      Another tweet      Another tweet
4  Not another tweet  Not_another tweet
5          Tweet not          Tweet not

相关问题