regex 使用掩码和正则表达式从数组中获取数字

xpszyzbs  于 2023-11-20  发布在  其他
关注(0)|答案(6)|浏览(116)

这个数组有codecollection,其中X是一个掩码,可以是“任何数字”:

input_array = [{"code": "XXXX10", "collection": "one"}, {"code": "XXX610", "collection": "two"}, {"code": "XXXX20", "collection": "three"}]

字符串
我想要一个函数,给定任何6位代码,例如000710返回匹配 best 代码掩码的值(例如one)。这是我的尝试:

def get_collection_from_code(analysis_code):
    for collection in input_array:
        actual_code = collection["code"]
        mask_to_re = actual_code.replace("X", "[\d\D]")
        pattern = re.compile("^" + mask_to_re + "$")
        if pattern.match(analysis_code):
            print("Found collection '" + str(collection["collection"]) + "' for code: " + str(analysis_code))
            return collection["collection"]
        
res = get_collection_from_code("010610")
print(res)


这里的问题是,如果我输入代码010610(我想返回two),它返回one,因为它也首先匹配模式XXXX10
为了更好地理解,如果我输入这些值,我希望有这些输出:

010610 > two
010010 > one
123420 > three

sr4lhrrt

sr4lhrrt1#

你可以遍历整个集合,保存任何匹配的非X部分的长度,然后返回最长的:

input_array = [{"code": "XXXX10", "collection": "one"}, {"code": "XXX610", "collection": "two"}, {"code": "XXXX20", "collection": "three"}]   

def get_collection_from_code(analysis_code):
    results = {}
    for collection in input_array:
        actual_code = collection["code"]
        mask_to_re = actual_code.replace("X", "[\d\D]")
        pattern = re.compile("^" + mask_to_re + "$")
        if pattern.match(analysis_code):
            results[collection["collection"]] = len(actual_code.replace('X', ''))
    if len(results):
        best = sorted(results.items(), key=lambda i:i[1], reverse=True)[0]
        print("Found collection '" + str(best[0]) + "' for code: " + str(analysis_code))
        return best[0]

res = get_collection_from_code("010610")
# Found collection 'two' for code: 010610

字符串
注意我已经保存了所有的匹配,以防你想以任何方式处理它们。否则你可以在每次迭代中检查“最佳”匹配并更新它。

bvhaajcl

bvhaajcl2#

您可以使用自定义函数来计算匹配的数量

inpt = '000710'

# rework the list of dictionaries into a {code: collection} dict
tmp = {d['code']: d['collection'] for d in input_array}

def n_matches(A, B):
    return sum(a==b for a, b in zip(A, B))

out = tmp[max(tmp, key=lambda x: n_matches(x, inpt))]

字符串
输出量:

#inpt = '000710'
'one'

#inpt = '010610'
'two'

#inpt = '123420'
'three'

omtl5h9j

omtl5h9j3#

您可以创建变量来跟踪最佳匹配并返回相应的集合。此外,您可以根据代码中的固定位数比较长度以优先考虑匹配。

def get_collection_from_code(analysis_code):

    best_match = None
    best_match_collection = None

    for collection in input_data:
        actual_code = collection["code"]
        mask_to_re = actual_code.replace("X", "[\d\D]")
        pattern = re.compile("^" + mask_to_re + "$")

        if pattern.match(analysis_code):
            if best_match is None or len(actual_code.replace("X", "")) > len(best_match.replace("X", "")):
                best_match = actual_code
                best_match_collection = collection["collection"]

    if best_match_collection is not None:
        print("Found collection '" + str(best_match_collection) + "' for code: " + str(analysis_code))
        return best_match_collection

字符串

vjhs03f7

vjhs03f74#

对数组进行排序,使第一个匹配是最佳匹配。

input_array.sort(key=lambda x: x["code"].count("X"))

字符串

ldioqlga

ldioqlga5#

另一个可能的选择是计算SequenceMatchermaximumratio

from difflib import SequenceMatcher

def get_collection_from_code(analysis_code):
    return max(input_array,
        key=lambda d: SequenceMatcher(
            None, d["code"].replace("X", "0"), analysis_code).ratio()
        )["collection"]

字符串
输出量:

for c in ["010610", "010010", "123420"]:
    print(c, "=>", get_collection_from_code(c))
    
010610 => two
010010 => one
123420 => three


中间体:

'XXXX10' one   '010610' 0.5
'XXX610' two   '010610' 0.83 # << highest
'XXXX20' three '010610' 0.5

'XXXX10' one   '010010' 0.83 # << highest
'XXX610' two   '010010' 0.67
'XXXX20' three '010010' 0.5

'XXXX10' one   '123420' 0.17
'XXX610' two   '123420' 0.17
'XXXX20' three '123420' 0.33 # << highest

9jyewag0

9jyewag06#

如果查询函数get_collection_from_code要被多次调用,以 O(1) 时间复杂度执行查询的更有效方法是首先按X s的数量对输入代码进行排序,将它们加入一个交替模式,每个子模式都包含在一个捕获组中,并创建一个以相同顺序排序的集合列表,这样你就可以简单地使用match中捕获组的索引从给定的输入代码中获得集合:

import re
from operator import itemgetter

input = [
    {"code": "XXXX10", "collection": "one"},
    {"code": "XXX610", "collection": "two"},
    {"code": "XXXX20", "collection": "three"}
]
sorted_input = sorted(
    map(itemgetter('code', 'collection'), input),
    key=lambda t: t[0].count('X')
)
code_pattern = re.compile(
    '|'.join(
        f"({code.replace('X', '.')})"
        for code, _ in sorted_input
    )
)
collections = [collection for _, collection in sorted_input]

def get_collection_from_code(analysis_code):
    if match := code_pattern.fullmatch(analysis_code):
        return collections[match.lastindex - 1] # capture group is 1-based

字符串
以便:

print(get_collection_from_code('010610'))
print(get_collection_from_code('010010'))
print(get_collection_from_code('123420'))


产出:

two
one
three


演示:https://ideone.com/MLFmSY

相关问题