模糊过程提取一个给出不同结果

wko9yo5t  于 2022-10-23  发布在  其他
关注(0)|答案(2)|浏览(112)

我有一个数据框架,我正在尝试将一个列值Map到集合中的值。
Dataframe 为

Name   CallType    Location
ABC     IN          SFO
DEF     OUT         LHR
PQR     INCOMING    AMS
XYZ     OUTGOING    BOM
TYR     A_IN        DEL
OMN     A_OUT       DXB

我有一个常量列表,其中调用类型将被列表中的替换

call_type = set("IN","OUT")

所需 Dataframe

Name   CallType    Location
ABC     IN         SFO
DEF     OUT        LHR
PQR     IN         AMS
XYZ     OUT        BOM
TYR     IN         DEL
OMN     OUT        DXB

我写代码是为了检查响应,但检查过程。extractOne有时为OUTGOING给出IN(这是错误的),有时为OUNGOING给出OUT(这是正确的)
这是我的密码

data=[('ABC','IN','SFO),
('DEF','OUT','LHR),
('PQR','INCOMING','AMS),
('XYZ','OUTGOING','BOM),
('TYR','A_IN','DEL),
('OMN','A_OUT','DXB)]

df = pd.DataFrame(data,
                columns =['Name', 'CallType',
                'Location'])

call_types=set(['IN','OUT'])

df['Call Type'] = df['Call Type'].apply(lambda x: process.extractOne(x, list(call_types))[0])

total_rows=len(df)

for row_no in range(total_rows):
        row=df.iloc[row_no]
        print(row) // Here Sometimes OUTGOING sets as OUT and Sometimes IN . Shouldn't the result be consistent ?

我不确定是否有更好的方法。如果我遗漏了什么,有人能提出建议吗。

8zzbczxx

8zzbczxx1#

看起来Series.str.extract非常适合:

df['CallType'] = df.CallType.str.extract(r'(OUT|IN)')

print(df)

  Name CallType Location
0  ABC       IN      SFO
1  DEF      OUT      LHR
2  PQR       IN      AMS
3  XYZ      OUT      BOM
4  TYR       IN      DEL
5  OMN      OUT      DXB

或者,如果要显式使用call_types,可以执行以下操作:

df['CallType'] = df.CallType.str.extract(fr"({'|'.join(call_types)})")

# same result
whitzsjs

whitzsjs2#

一种可能的解决方案是使用difflib.get_close_matches

import difflib

df['CallType'] = df['CallType'].apply(
    lambda x: difflib.get_close_matches(x, call_type)[0])

输出:

Name CallType Location
0  ABC       IN      SFO
1  DEF      OUT      LHR
2  PQR       IN      AMS
3  XYZ      OUT      BOM
4  TYR       IN      DEL
5  OMN      OUT      DXB

另一种可能的解决方案:

df['CallType'] = np.where(df['CallType'].str.contains('OUT'), 'OUT', 'IN')

输出:


# same

相关问题