在NumPy数组中搜索序列

ej83mcc0  于 2023-10-19  发布在  其他
关注(0)|答案(3)|浏览(150)

假设我有以下数组:

array([2, 0, 0, 1, 0, 1, 0, 0])

我如何得到我有值序列出现的索引:[0,0]?因此,这种情况下的预期输出为:[1,2,6,7]

编辑:

1)请注意,[0,0]只是一个序列。它可以是[0,0,0][4,6,8,9][5,2,0],任何东西。
2)如果我的数组被修改为:array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0]),则具有相同序列的[0,0]的预期结果将是[1,2,3,4,8,9]
我在找一个NumPy的捷径。

ndh0cuux

ndh0cuux1#

嗯,这基本上是一个template-matching problem,在图像处理中经常出现。这篇文章中列出了两种方法:基于NumPy和OpenCV(cv 2)。

**方法#1:**使用NumPy,可以在输入数组的整个长度上创建一个2D滑动索引数组。因此,每一行将是元素的滑动窗口。接下来,将每行与输入序列匹配,这将为向量化解决方案带来broadcasting。我们查找所有True行,这些行表明它们是完美匹配的行,因此将是匹配的起始索引。最后,使用这些索引,创建一个索引范围,扩展到序列的长度,以给予我们所需的输出。实施将是-

def search_sequence_numpy(arr,seq):
    """ Find sequence in an array using NumPy only.

    Parameters
    ----------    
    arr    : input 1D array
    seq    : input 1D array

    Output
    ------    
    Output : 1D Array of indices in the input array that satisfy the 
    matching of input sequence in the input array.
    In case of no match, an empty list is returned.
    """

    # Store sizes of input array and sequence
    Na, Nseq = arr.size, seq.size

    # Range of sequence
    r_seq = np.arange(Nseq)

    # Create a 2D array of sliding indices across the entire length of input array.
    # Match up with the input sequence & get the matching starting indices.
    M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)

    # Get the range of those indices as final output
    if M.any() >0:
        return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
    else:
        return []         # No match found

**方法#2:**在OpenCV(cv 2)中,我们有一个针对template-matching的内置函数:cv2.matchTemplate。使用这个,我们将有开始匹配的索引。其余步骤与前一种方法相同。下面是cv2的实现:

from cv2 import matchTemplate as cv2m

def search_sequence_cv2(arr,seq):
    """ Find sequence in an array using cv2.
    """

    # Run a template match with input sequence as the template across
    # the entire length of the input array and get scores.
    S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)

    # Now, with floating point array cases, the matching scores might not be 
    # exactly zeros, but would be very small numbers as compared to others.
    # So, for that use a very small to be used to threshold the scorees 
    # against and decide for matches.
    thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.

    # Find the matching indices
    idx = np.where(S.ravel() < thresh)[0]

    # Get the range of those indices as final output
    if len(idx)>0:
        return np.unique((idx[:,None] + np.arange(seq.size)).ravel())
    else:
        return []         # No match found

样品运行

In [512]: arr = np.array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0])

In [513]: seq = np.array([0,0])

In [514]: search_sequence_numpy(arr,seq)
Out[514]: array([1, 2, 3, 4, 8, 9])

In [515]: search_sequence_cv2(arr,seq)
Out[515]: array([1, 2, 3, 4, 8, 9])

测试

In [477]: arr = np.random.randint(0,9,(100000))
     ...: seq = np.array([3,6,8,4])
     ...: 

In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))
Out[478]: True

In [479]: %timeit search_sequence_numpy(arr,seq)
100 loops, best of 3: 11.8 ms per loop

In [480]: %timeit search_sequence_cv2(arr,seq)
10 loops, best of 3: 20.6 ms per loop

看起来基于纯NumPy的是最安全和最快的!

pod7payv

pod7payv2#

我已经使用Divakar的解决方案很长一段时间了,它工作得很好。非常感谢您!然而,几天前,我需要一些更快的某个项目。使用strides https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html可以节省大量内存,因为它会创建一个“假副本”,numexpr https://github.com/pydata/numexpr的速度大约是numpy的两倍,但即使没有numexpr,它也相当快

import numexpr
import numpy as np

def rolling_window(a, window):
    """
    Generate a rolling window view of a 1-dimensional NumPy array.

    Parameters:
    a (numpy.ndarray): The input array.
    window (int): The size of the rolling window.

    Returns:
    numpy.ndarray: A view of the input array with shape (N - window + 1, window), where N is the size of the input array.

    Example:
    >>> a = np.array([1, 2, 3, 4, 5])
    >>> windowed = rolling_window(a, 3)
    >>> print(windowed)
    array([[1, 2, 3],
           [2, 3, 4],
           [3, 4, 5]])
    """

    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def circular_rolling_window(a, window):
    """
    Generate a circular rolling window view of a 1-dimensional NumPy array.

    Parameters:
    a (numpy.ndarray): The input array.
    window (int): The size of the circular rolling window.

    Returns:
    numpy.ndarray: A view of the input array with shape (N, window), where N is the size of the input array, and the window wraps around at the boundaries.

    Example:
    >>> a = np.array([1, 2, 3, 4, 5])
    >>> circular_windowed = circular_rolling_window(a, 3)
    >>> print(circular_windowed)
    array([[1, 2, 3],
           [2, 3, 4],
           [3, 4, 5],
           [4, 5, 1],
           [5, 1, 2]])
    """

    pseudocircular = np.pad(a, pad_width=(0, window - 1), mode="wrap")
    return rolling_window(pseudocircular, window)

def find_sequence_in_array(sequence, array, numexpr_enabled=True):
    """
    Find occurrences of a sequence in a 1-dimensional NumPy array using a rolling window approach.

    Parameters:
    sequence (numpy.ndarray): The sequence to search for.
    array (numpy.ndarray): The input array to search within.
    numexpr_enabled (bool, optional): Whether to use NumExpr for efficient computation (default is True).

    Returns:
    numpy.ndarray: An array of indices where the sequence is found in the input array.

    Example:
    >>> arr = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
    >>> seq = np.array([3, 4, 5])
    >>> indices = find_sequence_in_array(seq, arr)
    >>> print(indices)
    [2 7]
    """

    a3 = circular_rolling_window(array, len(sequence))
    if numexpr_enabled:
        isseq = numexpr.evaluate(
            "a3==sequence", global_dict={}, local_dict={"a3": a3, "sequence": sequence}
        )
        su1 = numexpr.evaluate(
            "sum(isseq,1)", global_dict={}, local_dict={"isseq": isseq.astype(np.int8)}
        )
        wherelen = numexpr.evaluate(
            "(su1==l)", global_dict={}, local_dict={"su1": su1, "l": len(sequence)}
        )
    else:
        isseq = a3 == sequence
        su1 = np.sum(isseq, axis=1)
        wherelen = su1 == len(sequence)

    resu = np.nonzero(wherelen)
    return resu[0]
seq = np.array([3, 6, 8, 4])
arr = np.random.randint(0, 9, (100000,))
%timeit a3 = find_sequence_in_array(sequence=seq, array=arr, numexpr_enabled=True)
1.32 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit a3 = find_sequence_in_array(sequence=seq, array=arr, numexpr_enabled=False)
2.2 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a4 = search_sequence_numpy(arr=arr, seq=seq)
4.96 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
rekjcdws

rekjcdws3#

我发现最简洁、直观和通用的方法是使用正则表达式。

import re
import numpy as np

# Set the threshold for string printing to infinite
np.set_printoptions(threshold=np.inf)

# Remove spaces and linebreaks that would come through when printing your vector
yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]

# The next line is the most important, set the arguments in the braces
# such that the first argument is the shortest sequence you want
# and the second argument is the longest (using empty as infinite length)

r = re.compile(r"[0]{1,}") 
zero_starts = [m.start() for m in r.finditer( yourarray_string )]
zero_ends = [m.end() for m in r.finditer( yourarray_string )]

相关问题