scipy Copy关键字打破了numpy的复制/查看理念

blmhpbnm 于 2022-11-23 发布在其他

关注(0)|答案(1)|浏览(110)

我注意到，用于在稀疏矩阵类型之间进行转换的方法中，没有一个使用方法中提供的copy kwarg。（在有效的情况下）总是有一个base集，这意味着它在代码中显示为view。
这是故意的行为吗？
例如，这里有csr和csc数组的例子，正如你所看到的，它们都有基，不管是什么。

In [1]: import numpy as np
   ...: from scipy import sparse
   ...: 
   ...: a = np.arange(20).reshape(4, 5)
   ...: csr = sparse.csr_array(a, copy=True)
   ...: print('csr.data.base', id(csr.data.base) if csr.data.base is not None else None)
   ...: 
   ...: csr_copy = csr.copy()
   ...: print('csr_copy.data.base', id(csr_copy.data.base) if csr_copy.data.base is not None else None)
   ...: 
   ...: csc_copy = csr.tocsc(copy=True)
   ...: print('csc_copy.data.base', id(csc_copy.data.base) if csc_copy.data.base is not None else None)
   ...: 
   ...: csc_copy_2 = csr.tocsc()
   ...: print('csc_copy_2.data.base', id(csc_copy_2.data.base) if csc_copy_2.data.base is not None else None)
csr.data.base 4392865488
csr_copy.data.base 4392866448
csc_copy.data.base 4392866640
csc_copy_2.data.base 4392867120

虽然csr_copy具有与csr.data相同的base是有意义的，但我不明白为什么其他对象都为数据设置了base属性。
特别是，此行为会防止使用者直接操作数组的data和indices参数。例如，无法使用resize方法新增数据列来扩充csr矩阵：

In [2]: old_nnz = csr.nnz 
   ...: row = [1, 2, 3, 4, 5]  # Lets append row of 5 elements to csr
   ...: 
   ...: csr.resize(5, 5)
   ...: 
   ...: print(id(csr.data))
   ...: print(csr.data)
   ...: 
   ...: print(id(csr.data.base))
   ...: print(csr.data.base)
   ...: 
   ...: csr.data.resize((old_nnz + len(row),), refcheck=True)
4757413808
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
4757413520
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-34-c52e3457494e>", line 12, in <module>
    csr.data.resize((old_nnz + len(row),), refcheck=True)
ValueError: cannot resize this array: it does not own its data

虽然使用np.resize可能可行，但我不确定它是否合适：

In [3]: old_nnz = csr.nnz 
   ...: row = [1, 2, 3, 4, 5]  # Let's append row of 5 elements to csr
   ...: 
   ...: csr.resize(5, 5)
   ...: 
   ...: print('Data')
   ...: print(id(csr.data))
   ...: print(csr.data)
   ...: 
   ...: print("Data's Base")
   ...: print(id(csr.data.base))
   ...: print(csr.data.base)
   ...: 
   ...: print('New Data')
   ...: new_data = np.resize(csr.data, (old_nnz + len(row),))
   ...: print(id(new_data))
   ...: print(new_data)
   ...: 
   ...: print("New Data's Base")
   ...: print(id(new_data.base))
   ...: print(new_data.base)
   ...:
   ...: new_indices = np.resize(csr.indices, (old_nnz + len(row),))

Data
5256251600
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
Data's Base
5256250736
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
New Data
5256250928
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19  1  2  3  4  5]
New Data's Base
5256253040
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19  1  2  3  4  5
  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

我一直在阅读这些函数的源代码，但我没有看到其中一些函数使用了copy。

def tocsc(self, copy=False):
    idx_dtype = get_index_dtype((self.indptr, self.indices),
    maxval=max(self.nnz, self.shape[0]))
    indptr = np.empty(self.shape[1] + 1, dtype=idx_dtype)
    indices = np.empty(self.nnz, dtype=idx_dtype)
    data = np.empty(self.nnz, dtype=upcast(self.dtype))

    csr_tocsc(self.shape[0], self.shape[1],
        self.indptr.astype(idx_dtype),
        self.indices.astype(idx_dtype),
        self.data,
        indptr,
        indices,
        data)

    A = self._csc_container((data, indices, indptr), shape=self.shape)
    A.has_sorted_indices = True
    return A

即使我看到新的数组（data）被创建了，在某个地方，也许在C/Python接口之间的某个地方，它被放入base。

scipy

来源：https://stackoverflow.com/questions/74542785/copy-keyword-breaks-numpys-copy-view-philosophy

1条答案

按热度按时间

v1uwarro1#

我只有scipy v 1.7.3，所以没有访问1.8中稀疏模块的主要重写（例如，不是csr_array或_data.py文件）。
是否有base并不能可靠地衡量一个副本是否被复制。

In [74]: a = np.arange(20).reshape(4, 5)
    ...:   ...: csr = sparse.csr_matrix(a, copy=True)

In [75]: a
Out[75]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [76]: a.base
Out[76]: 
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

a是由arange产生的1d数组的view。该数组不可访问-除非作为base。

In [77]: csr
Out[77]: 
<4x5 sparse matrix of type '<class 'numpy.intc'>'
    with 19 stored elements in Compressed Sparse Row format>

data属性有一个base-看起来和它自己一样。id是不同的。但是我们必须研究代码，看看data是如何从它的base派生出来的。

In [78]: csr.data
Out[78]: 
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19], dtype=int32)

In [79]: csr.data.base
Out[79]: 
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19], dtype=int32)

它不是a或a.base的view，我们可以通过修改元素来证明。

In [82]: csr.data[0] = 100

In [83]: csr.A
Out[83]: 
array([[  0, 100,   2,   3,   4],
       [  5,   6,   7,   8,   9],
       [ 10,  11,  12,  13,  14],
       [ 15,  16,  17,  18,  19]], dtype=int32)

In [84]: a
Out[84]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

copy参数在保持相同格式时最有意义。更改格式可能涉及重新排序数据（csr到csc），或对重复项求和（coo到csr）等。
让我们尝试创建一个新的csr：

In [87]: csr1 = sparse.csr_matrix(csr, copy=False)
In [88]: csr2 = sparse.csr_matrix(csr, copy=True)

和csr一样，这两个元素都有data.base和不同的id，但是如果我修改csr的一个元素，这个改变只会出现在csr1中。csr2实际上是一个副本。

In [93]: csr.data[1] = 200
In [97]: csr1.data[1]
Out[97]: 200
In [98]: csr2.data[1]
Out[98]: 2

调整大小

我以前没有用过resize来进行稀疏运算，也很少用它来进行numpy运算。
csr.resize(5,5)似乎只是更改了indptr（和shape），而没有更改为data或indices。
csr.resize(5,6)似乎只是改变了shape。我没有看到主属性的变化。两者都没有添加非零值，所以用0填充没有太大变化。
您不希望执行csr.data.resize(...)。这样的更改还需要更改indices和indptr（以保持一致的csr）。data可以有0，但应该通过调用eliminate_zeros来清除。

赞(0）回复(0）举报 2022-11-23

我来回答

scipy Copy关键字打破了numpy的复制/查看理念

1条答案

调整大小

相关问题

热门标签

最新问答