基于pandas中的列表列创建新列

v8wbuo2f  于 2024-01-04  发布在  其他
关注(0)|答案(2)|浏览(118)

我试图根据列表列的值创建一个新列。如果列值包含特定字符串,如high,HIGH,Important等,则新列应包含High/Important值

id         specification
123 ['high', 'Important', 'pilot']
234 ['HIGH', 'Important', 'Baby']
543 ['important']
542 ['week']
857 ['new', 'IMPORTANT']
123 ['super_high' 'test']

字符串
我期望的new_col是

id      specification                 new_col    
123 ['high', 'Important', 'pilot']  High/Important
234 ['HIGH', 'Important', 'Baby']   High/Important
543 ['important']                   High/Important
542 ['week'] 
857 ['new', 'IMPORTANT']            High/Important
123 ['super_high' 'test']           High/Important


由于列'specification'包含列表值。str.contains()将不起作用。我们有什么方法可以在pandas中实现

9jyewag0

9jyewag01#

要匹配完整的单词,请使用列表解析,并在set中搜索:

labels = {'high', 'important'}

df['new_col'] = ['High/Important' if
                   any(x.lower() in labels for x in l)
                 else '' for l in df['specification']]

字符串
输出量:

id             specification         new_col
0  123  [high, Important, pilot]  High/Important
1  234   [HIGH, Important, Baby]  High/Important
2  543               [important]  High/Important
3  542                    [week]                
4  857          [new, IMPORTANT]  High/Important
5  123          [super_hightest]


要匹配子字符串,请执行以下操作:

labels = ['high', 'important']
df['new_col'] = ['High/Important' if
                   any(w in x for x in map(str.lower, l) for w in labels)
                 else '' for l in df['specification']]


或者使用正则表达式匹配:

df['new_col'] = np.where(df['specification'].apply(' '.join)
                          .str.lower()
                          .str.contains('high|important'),
                         'High/Important', '')


输出量:

id             specification         new_col
0  123  [high, Important, pilot]  High/Important
1  234   [HIGH, Important, Baby]  High/Important
2  543               [important]  High/Important
3  542                    [week]                
4  857          [new, IMPORTANT]  High/Important
5  123          [super_hightest]  High/Important


使用的输入:

df = pd.DataFrame({'id': [123, 234, 543, 542, 857, 123],
                   'specification': [['high', 'Important', 'pilot'],
                                     ['HIGH', 'Important', 'Baby'],
                                     ['important'],
                                     ['week'],
                                     ['new', 'IMPORTANT'],
                                     ['super_hightest']]})

kgqe7b3p

kgqe7b3p2#

或者,考虑使用helper函数和apply创建一个新列。

import re

# helper function substring match
def high_important(specs):
    spec_string = "".join(specs)
    
    if re.findall(r"high|important", spec_string, re.IGNORECASE):
        return "High/Important"
    else:
        return ""

# new column based on existing df["specification"]
df["new_col"] = df.apply(
        lambda df: high_important(df["specification"]), 
        axis=1)
)

字符串
输出量:

id             specification         new_col
0  123  [high, Important, pilot]  High/Important
1  234   [HIGH, Important, Baby]  High/Important
2  543               [important]  High/Important
3  542                    [week]                
4  857          [new, IMPORTANT]  High/Important
5  123        [super_high, test]  High/Important

相关问题