numpy Python列联表

bq3bfh9z 于 2023-10-19 发布在 Python

关注(0)|答案(2)|浏览(99)

我正在生成许多，许多列联表作为我正在编写的项目的一部分。
工作流程为：

取一个包含连续（浮点）行的大型数据数组，并通过合并将其转换为离散整数值（例如，结果行的值为0-9）
将两行分割成向量X和Y，并从中生成contingency table，这样我就得到了二维频率分布
例如，我有一个10 x 10的数组，计算出现的（xi，yi）的数量
用列联表做一些信息论数学

最初，我这样写：

def make_table(x, y, num_bins):
    ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))
    for xn, yn in zip(x, y):
        ctable[xn, yn] += 1
    return ctable

这工作正常，但速度太慢，以至于占用了整个项目90%的运行时间。
我能想到的最快的Python优化是这样的：

def make_table(x, y, num_bins):
    ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))
    reindex = np.dot(np.stack((x, y)).transpose(), 
                     np.array([num_bins, 1]))
    idx, count = np.unique(reindex, return_counts=True)
    for i, c in zip(idx, count):
        ctable[i] = c
    return ctable.reshape((num_bins, num_bins))

这（不知何故）快得多，但它仍然是相当昂贵的东西，似乎不应该是一个瓶颈。有没有什么有效的方法来做这件事，我只是没有看到，或者我应该给予，并在cython中做这件事？
另外，这里还有一个基准测试功能。

def timetable(func):
    size = 5000
    bins = 10
    repeat = 1000
    start = time.time()
    for i in range(repeat):
        x = np.random.randint(0, bins, size=size)
        y = np.random.randint(0, bins, size=size)
        func(x, y, bins)
    end = time.time()
    print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))

numpy

来源：https://stackoverflow.com/questions/51294382/python-contingency-table

2条答案

按热度按时间

wb1gzix01#

将np.stack((x, y))的元素表示为整数的巧妙技巧可以更快：

In [92]: %timeit np.dot(np.stack((x, y)).transpose(), np.array([bins, 1]))
109 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [94]: %timeit bins*x + y
12.1 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

此外，第二个解决方案的最后一部分可以稍微简化，只需考虑

np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))

更重要的是，由于我们处理的是等间距的非负整数，np.bincount将优于np.unique;因此，上述内容可以归结为

np.bincount(bins * x + y).reshape((bins, bins))

总而言之，这提供了相当多的性能比你目前正在做的：

In [78]: %timeit make_table(x, y, bins)  # Your first solution
3.86 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [79]: %timeit make_table2(x, y, bins)  # Your second solution
443 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [101]: %timeit np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))
307 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [118]: %timeit np.bincount(bins * x + y).reshape((10, 10))
30.3 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

您可能还希望了解np.histogramdd，它同时负责舍入和合并，尽管它可能会比舍入和使用np.bincount慢。

赞(0）回复(0）举报 2023-10-19

qncylg1j2#

这个答案仅仅是将Fuglede的答案扩展到x和y可以离散成不同数量的箱的情况。它确实假设x和y的类别分别编号为0, 1, ..., n_bins_x和0, 1, ..., n_bins_y。

def make_table(x, y):
    n_bins_x = np.max(x) + 1
    n_bins_y = np.max(y) + 1
    tab = np.bincount(
        n_bins_y * x + y, minlength=n_bins_x * n_bins_y
    ).reshape((n_bins_x, n_bins_y))
    return tab

赞(0）回复(0）举报 2023-10-19

我来回答

numpy Python列联表

2条答案

相关问题

热门标签

最新问答