如何在Python中模糊匹配两个列表

fquxozlt 于 2022-12-02 发布在 Python

关注(0)|答案(2)|浏览(401)

我有两个清单：ref_list和inp_list。如何使用FuzzyWuzzy从引用列表中匹配输入列表？

inp_list = pd.DataFrame(['ADAMS SEBASTIAN',  'HAIMBILI SEUN',  'MUTESI 
                          JOHN', 'SHEETEKELA MATT', 'MUTESI JOHN KUTALIKA', 
                          'ADAMS SEBASTIAN HAUSIKU', 'PETERS WILSON', 
                          'PETERS MARIO', 'SHEETEKELA  MATT NICKY'],
                          columns =['Names'])


ref_list = pd.DataFrame(['ADAMS SEBASTIAN HAUSIKU', 'HAIMBILI MIKE', 'HAIMBILI SEUN', 'MUTESI JOHN 
                         KUTALIKA', 'PETERS WILSON MARIO', 'SHEETEKELA  MATT NICKY MBILI'], columns = 
                        ['Names'])

经过一些研究，我修改了一些代码，我发现在互联网上。这些代码的问题-他们工作得很好，在小样本大小。在我的情况下，inp_list和ref_list分别是29 k和18 k的长度，它需要一天多的时间来运行。
下面是代码，首先定义了一个helper函数。

def match_term(term, inp_list, min_score=0):
    # -1 score in case I don't get any matches
    max_score = -1
    
    # return empty for no match 
    max_name = ''
    
    # iterate over all names in the other
    for term2 in inp_list:
        # find the fuzzy match score
        score = fuzz.token_sort_ratio(term, term2)
    
        # checking if I am above my threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = term2
            max_score = score
    return (max_name, max_score)

# list for dicts for easy dataframe creation
dict_list = []

#iterating over the sales file
for name in inp_list:
    #use the defined function above to find the best match, also set the threshold to a chosen #
    match = match_term(name, ref_list, 94)
    
    #new dict for storing data
    dict_ = {}
    dict_.update({'passenger_name': name})
    dict_.update({'match_name': match[0]})
    dict_.update({'score': match[1]})
    
    dict_list.append(dict_)

这些代码在哪里可以改进，以顺利运行，并可能避免评估已经评估的项目？

python

来源：https://stackoverflow.com/questions/62790165/how-to-fuzzy-match-two-lists-in-python

2条答案

按热度按时间

kcwpcxri1#

您可以尝试对操作进行矢量化，而不是在循环中评估分数。
创建一个df，其中第一列ref是ref_list，第二列inp是inp_list中的每个名字。然后调用df.apply(lambda row:process.extractOne(row['inp'], row['ref']), axis=1)。最后，您将获得最佳匹配的名字，并为inp_list中的每个名字在ref_list中得分。

赞(0）回复(0）举报 2022-12-02

n7taea2i2#

你所使用的方法对于大量的字符串对来说计算要求很高。作为fuzzywuzzy的替代，你可以尝试使用一个名为string-grouper的库，它利用了一个更快的Tf-idf方法和余弦相似性度量来查找相似的单词。例如：

import random, string, time
import pandas as pd
from string_grouper import match_strings

alphabet = list(string.ascii_lowercase)
from_r, to_r = 0, len(alphabet)-1

random_strings_1 = ["".join(alphabet[random.randint(from_r, to_r)]
                            for i in range(6)) for j in range(5000)]
random_strings_2 = ["".join(alphabet[random.randint(from_r, to_r)]
                            for i in range(6)) for j in range(5000)]
                
series_1 = pd.Series(random_strings_1)
series_2 = pd.Series(random_strings_2)

t_1 = time.time()
matches = match_strings(series_1, series_2,
                        min_similarity=0.6)
t_2 = time.time()
print(t_2 - t_1)
print(matches)

只需不到一秒钟的时间就可以完成25.000.000次比较！要查看更有用的库测试，请访问：https://bergvca.github.io/2017/10/14/super-fast-string-matching.html，其中声称
“使用这种方法，只需一台双核笔记本电脑，就可以在42分钟内在一组663，000个公司名称中搜索近似重复的名称”。
要进一步优化匹配算法，请查看可以为上面的match_strings函数给予的**kwargs参数。

赞(0）回复(0）举报 2022-12-02

我来回答

如何在Python中模糊匹配两个列表

2条答案

相关问题

热门标签

最新问答