gensim scipy2scipy_clipped可能会返回一个与输入矩阵形状不同的矩阵,

vdzxcuhz  于 4个月前  发布在  其他
关注(0)|答案(1)|浏览(60)

描述

scipy2scipy_clipped函数在输入矩阵的最后一维项不在任何行的前几个相似项中时,可能会返回一个形状不同的截断矩阵。这在SparseMatrixSimilarity的分块过程中尤为可能,因为相似度矩阵不完整,所以我们看不到最后一行的最后一列(它通常包含1)。

步骤/代码/语料库以重现

示例:

from scipy.sparse import random, vstack
from gensim.matutils import scipy2scipy_clipped
from sklearn.metrics.pairwise import cosine_similarity

#Some random sparse matrix
X = random(1000, 2000, density=.2, format="csc")

#Getting its similarity matrix
X_sim = cosine_similarity(X, dense_output=False)

#Splitting it to simulate chunking
X_sim_chunk1 = X[:500, :]
X_sim_chunk2 = X[500:, :]

#Assuring that in the first chunk no row is similar to the last item
X_sim_chunk1[:, -1] = 0

X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 1999)

X_clipped2 = scipy2scipy_clipped(X_sim_chunk2, 100)
print(X_clipped2.shape) # (500, 2000)

#While trying to recreate the matrix, this fails because of dimensions' inconsistency
vstack([X_clipped1, X_clipped2])
# ValueError: incompatible dimensions for axis 1

预期结果

X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 1000)

实际结果

X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 999)

相关问题