描述
scipy2scipy_clipped函数在输入矩阵的最后一维项不在任何行的前几个相似项中时,可能会返回一个形状不同的截断矩阵。这在SparseMatrixSimilarity的分块过程中尤为可能,因为相似度矩阵不完整,所以我们看不到最后一行的最后一列(它通常包含1)。
步骤/代码/语料库以重现
示例:
from scipy.sparse import random, vstack
from gensim.matutils import scipy2scipy_clipped
from sklearn.metrics.pairwise import cosine_similarity
#Some random sparse matrix
X = random(1000, 2000, density=.2, format="csc")
#Getting its similarity matrix
X_sim = cosine_similarity(X, dense_output=False)
#Splitting it to simulate chunking
X_sim_chunk1 = X[:500, :]
X_sim_chunk2 = X[500:, :]
#Assuring that in the first chunk no row is similar to the last item
X_sim_chunk1[:, -1] = 0
X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 1999)
X_clipped2 = scipy2scipy_clipped(X_sim_chunk2, 100)
print(X_clipped2.shape) # (500, 2000)
#While trying to recreate the matrix, this fails because of dimensions' inconsistency
vstack([X_clipped1, X_clipped2])
# ValueError: incompatible dimensions for axis 1
预期结果
X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 1000)
实际结果
X_clipped1 = scipy2scipy_clipped(X_sim_chunk1, 100)
print(X_clipped1.shape) # (500, 999)
1条答案
按热度按时间prdp8dxp1#
感谢报告@psorianom 👍