scipy 在Python中动态地在磁盘上构造稀疏矩阵

cunj1qz1 于 2022-11-10 发布在 Python

关注(0)|答案(4)|浏览(128)

我目前正在做一些内存密集型的文本处理，为此我必须构造float32s的sparse matrix，维度为~ (2M, 5M)。在阅读5 M个文档的语料库时，我逐列构造这个矩阵。为此，我使用SciPy的稀疏dok_matrix数据结构。然而，当到达第500000个文档时，我的记忆已满（大约使用了30 GB），程序崩溃。我最终想做的是使用sklearn对矩阵执行降维算法，但是，正如前面所说，在内存中保存和构造整个矩阵是不可能的。我已经研究了numpy.memmap，因为sklearn支持这一点，并尝试对SciPy稀疏矩阵的一些底层numpy数据结构进行memmap，但我无法成功。
我不可能以密集格式保存整个矩阵，因为这将需要40 TB的磁盘空间。所以我认为HDF5和PyTables对我来说是没有选择的（？）。
我现在的问题是：我如何在运行中构建一个稀疏矩阵，但直接写入磁盘而不是内存，这样我就可以在sklearn中使用它了？
谢谢你！

scipy

来源：https://stackoverflow.com/questions/31031597/construct-sparse-matrix-on-disk-on-the-fly-in-python

4条答案

按热度按时间

j0pj023g1#

在单细胞基因组学数据处理磁盘上的大型稀疏数据集时，我们也遇到过类似的问题。我将向大家展示一个简单的小例子来说明我将如何处理这个问题。我的假设是，内存非常有限，可能无法同时将稀疏矩阵的多个副本放入内存。即使您无法将一个完整的副本放入内存，这个方法也会起作用。
我将在磁盘上逐列构建稀疏CSC矩阵。稀疏CSC矩阵使用3个底层数组：

data：存储在矩阵中的值
indices：矩阵中每个值的行索引
indptr：长度为n_cols + 1的数组，该数组将indices和data除以它们所属的列。

作为说明性示例，列i的值存储在data的范围indptr[i]:indptr[i+1]中。类似地，这些值的行索引可以通过indices[indptr[i]:indptr[i+1]]找到。
为了模拟您的数据生成过程（我假设是解析文档），我将定义一个函数process_document，它返回相关文档的indices和data的值。

import numpy as np
import h5py
from scipy import sparse

from tqdm import tqdm  # For monitoring the writing process
from typing import Tuple, Union  # Just for argument annotation

def process_document():
    """
    Simulate processing a document. Results in sparse vector represenation.
    """
    n_items = np.random.negative_binomial(2, .0001)
    indices = np.random.choice(2_000_000, n_items, replace=False)
    indices.sort()
    data = np.random.random(n_items).astype(np.float32)
    return indices, data

def data_generator(n):
    """Iterator which yields simulated data."""
    for i in range(n):
        yield process_document()

现在，我将在hdf5文件中创建一个组，该文件将存储稀疏矩阵的组成数组。

def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
    """
    Create a group in an hdf5 file that can store a CSC sparse matrix.
    """
    g = f.create_group(groupname)
    g.attrs["shape"] = shape
    g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
    g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
    g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
    return g

最后是一个函数，用于阅读这个组作为稀疏矩阵（这个函数非常简单）。

def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
    return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])

现在，我们将创建磁盘上的稀疏矩阵，并一次向其中写入一列（我使用了较少的列，因为这可能会有点慢）。

N_COLS = 10

def make_disk_matrix(f, groupname, data_iter, shape):
    group = make_sparse_csc_group(f, "mtx", shape)

    indptr = group["indptr"]
    data = group["data"]
    indices = group["indices"]
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices.resize((n_total,))
        data.resize((n_total,))
        indices[n_prev:] = cur_indices
        data[n_prev:] = cur_data
        indptr[doc_num+1] = n_total

# Writing

with h5py.File("data.h5", "w") as f:
    make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))

# Reading

with h5py.File("data.h5", "r") as f:
    mtx = read_sparse_csc_group(f["mtx"])

同样，这也是考虑到内存非常有限的情况，在这种情况下，您可能无法在创建稀疏矩阵时将整个稀疏矩阵放入内存。如果您可以处理整个稀疏矩阵加上至少一个副本，则有一种更快的方法可以做到这一点，那就是不使用磁盘存储（类似于其他建议）。然而，对以下代码进行轻微修改应该会给予更好的性能：

def make_memory_mtx(data_iter, shape):
    indices_list = []
    data_list = []
    indptr = np.zeros(shape[1]+1, dtype=int)
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices_list.append(cur_indices)
        data_list.append(cur_data)
        indptr[doc_num+1] = n_total

    indices = np.concatenate(indices_list)
    data = np.concatenate(data_list)

    return sparse.csc_matrix((data, indices, indptr), shape=shape)

mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))

这应该是相当快的，因为它只在你连接数组时复制数据。其他当前发布的解决方案在你处理数组时重新分配数组，复制许多大数组。

赞(0）回复(0）举报 2022-11-10

mbskvtky2#

如果你能提供一个最小的工作代码，那就太好了。我看不出你的矩阵是因为构造而变得太大了（1）还是因为你有太多的数据（2）。如果你真的不关心自己建立这个矩阵，你可以直接看我的备注2。
对于问题（1），在下面的示例代码中，我创建了一个 Package 器类来逐块构建csr_matrix。（行、列、数据）元组，直到达到缓冲区限制当达到限制时，它将减少内存中的数据，因为csr_matrix构造函数添加的数据具有相同（row，column）元组.这部分只允许你以快速的方式构造稀疏矩阵（比为每一行创建稀疏矩阵快得多），并避免当一个单词在文档中出现多次时，由于（row，column）的冗余而导致的内存错误。

import numpy as np
import scipy.sparse

class SparseMatrixBuilder():
    def __init__(self, shape, build_size_limit):
        self.sparse_matrix = scipy.sparse.csr_matrix(shape)
        self.shape = shape
        self.build_size_limit = build_size_limit
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []

    def add(self, data, col_indices, row_indices):
        self.data_temp.append(data)
        self.col_indices_temp.append(col_indices)
        self.row_indices_temp.append(row_indices)
        if len(self.data_temp) == self.build_size_limit:
            self.sparse_matrix += scipy.sparse.csr_matrix(
                (np.concatenate(self.data_temp),
                 (np.concatenate(self.col_indices_temp),
                  np.concatenate(self.row_indices_temp))),
                shape=self.shape
            )
            self.data_temp = []
            self.col_indices_temp = []
            self.row_indices_temp = []

    def get_matrix(self):
        self.sparse_matrix += scipy.sparse.csr_matrix(
            (np.concatenate(self.data_temp),
             (np.concatenate(self.col_indices_temp),
              np.concatenate(self.row_indices_temp))),
            shape=self.shape
        )
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []
        return self.sparse_matrix

对于问题（2），你可以很容易地扩展这个类，通过添加一个保存方法，在达到极限（或第二个极限）时将矩阵存储在磁盘上。这样，你将在磁盘上得到多个稀疏矩阵块。然后你将需要一个可以处理块矩阵的降维算法（见注解2）。
备注1：这里的缓冲区限制并没有真正定义好。最好检查numpy数组data_temp，col_indices_temp和row_indices_temp的实际大小，并与机器上可用的RAM进行比较（这很容易用python自动化）。
备注2：gensim是一个python库，它在使用分块文件构建NLP模型方面有很大的优势。因此，你可以用这个库构建一个字典，构造一个稀疏矩阵，并降低它的维数，而不需要太多的RAM。

赞(0）回复(0）举报 2022-11-10

r1zk6ea13#

我假设所有的数据都可以使用一种更适合内存的稀疏矩阵格式（如COO）放入内存。如果不能，那么即使使用mmap，也几乎没有希望继续使用sklearn。实际上，sklearn很可能会创建后续对象，其内存需求与输入的数量级相同。
Scipy的dok_matrix实际上是vanilla dict的一个子类。它们使用单独的python对象和大量的指针来存储数据，因此内存效率不高。最紧凑的表示是coo_matrix。您可以通过为坐标（行和列）和数据预分配数组来增量地构建创建COO矩阵所需的数据;如果最初的猜测是错误的，最终会增加这些缓冲区。

def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
    counter = 0
    rows = numpy.empty(n_data_hint, dtype=idx_dtype)
    cols = numpy.empty(n_data_hint, dtype=idx_dtype)
    data = numpy.empty(n_data_hint, dtype=data_dtype)
    for row, col, value in iterable:
        if counter >= n_data_hint:
            n_data_hint *= 2
            rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
        rows[counter] = row
        cols[counter] = col
        data[counter] = value
        counter += 1
    rows = rows[:counter]
    cols = cols[:counter]
    data = data[:counter]
    return coo_matrix((data, (rows, cols)))

def _reallocate(rows, cols, data, n):
    new_rows = numpy.empty(n, dtype=rows.dtype)
    new_cols = numpy.empty(n, dtype=cols.dtype)
    new_data = numpy.empty(n, dtype=data.dtype)
    new_rows[:rows.size] = rows
    new_cols[:cols.size] = cols
    new_data[:data.size] = data
    return new_rows, new_cols, new_data

你可以用随机生成的数据来测试它，如下所示：

def get_random_data(n, max_row=2000, max_col=5000):
    for _ in range(n):
        row = numpy.random.choice(max_row)
        col = numpy.random.choice(max_col)
        val = numpy.random.randn()
        yield row, col, val

# test when initial hint is good

coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)

# or to test when initial hint was too tiny

coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)

有了COO矩阵后，您可能需要使用coo.tocsr()转换为CSR。CSR矩阵对于点积等常见运算进行了更优化。如果某些行最初为空，则需要更多内存。这是因为它存储所有行的指针，甚至是空行。

赞(0）回复(0）举报 2022-11-10

jqjz2hbq4#

看看here，最后他解释了如何直接存储和读取稀疏矩阵到一个Hdf5文件。

赞(0）回复(0）举报 2022-11-10