python/pandas:如何使用fuzzyfuzzy将列中的拼写错误替换为国家名称?

kadbb459  于 2021-08-20  发布在  Java
关注(0)|答案(1)|浏览(311)

我有一个大约500k行的 Dataframe ,其中包含一个名为 country ,等等。我的目标是替换 country 专栏有不同的排版。
例如:

import pandas as pd

# Starting dataset:

d = {'country': ['Unites Sates', 'United state','Cnda','canada','United State', 'United sates of America','Mexio','mexico','Mejico','America','U.S.A.','UsA of A','cAnada','u. s. a. ','United States of America']}
df = pd.DataFrame(data=d)
df

                     country
0               Unites Sates #wants to replace
1               United state #wants to replace
2                       Cnda #wants to replace
3                     canada #wants to replace
4               United State #wants to replace
5    United sates of America #wants to replace
6                      Mexio #wants to replace
7                     Mexico #wants to replace
8                     Mejico #wants to replace
9                    America #wants to replace
10                    U.S.A. #wants to replace
11                  UsA of A #wants to replace
12                    cAnada #wants to replace
13                 u. s. a.  #wants to replace
14  United States of America

# Expected Outcome:

d = {'country': ['United States of America','United States of America','Canada','Canada','United States of America','United States of America','Mexico','Mexico','Mexico', 'United States of America','United States of America','United States of America','Canada','United States of America','United States of America']}
df = pd.DataFrame(data=d)
df

                     country
0   United States of America #replaced
1   United States of America #replaced
2                     Canada #replaced
3                     Canada #replaced
4   United States of America #replaced
5   United States of America #replaced
6                     Mexico #replaced
7                     Mexico #replaced
8                     Mexico #replaced
9   United States of America #replaced
10  United States of America #replaced
11  United States of America #replaced
12                    Canada #replaced
13  United States of America #replaced
14  United States of America

我尝试的一件事是创建一个名为 correct_countries_df 包含正确的国家/地区名称,并将其用作:

df['country_BestMatch'] = df['country'].map(lambda x: process.extractOne(x, correct_countries_df['country'])[0])

但我似乎不能做到这一点。
有什么想法吗?
提前谢谢!

9lowa7mx

9lowa7mx1#

如果你的 correct_countries_df 看起来像:

>>> correct_countries_df

                    country
0  United States of America
1                    Canada
2                    Mexico

那么,您的代码是正确的

>>> df['country'].map(lambda x: process.extractOne(x, correct_countries_df['country'])[0])

0     United States of America
1     United States of America
2                       Canada
3                       Canada
4     United States of America
5     United States of America
6                       Mexico
7                       Mexico
8                       Mexico
9     United States of America
10    United States of America
11    United States of America
12                      Canada
13    United States of America
14    United States of America
Name: country, dtype: object

相关问题