scipy 为什么panda Dataframe版本的稀疏矩阵不能与imblearn的RandomOverSampler一起使用，而文档中却说它可以接受这两种方法？

toe95027 于 2022-11-10 发布在其他

关注(0)|答案(1)|浏览(82)

花了一个痛苦的夜晚调试

import pandas as pd
from imblearn.over_sampling import RandomOverSampler

x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(pd.DataFrame.sparse.from_spmatrix(x_trainvec), y_train)   

print(x_trainvec_rand)

其中，x_trainvec是csr稀疏矩阵，y_train是panda Dataframe ， Dataframe 中两者的维数分别为（75060 x 52651）和（75060 x 1），错误值为“ValueError：传递值的形状为（290210，1），索引表示为（290210，52651）'。
突然间我决定试着

import pandas as pd
from imblearn.over_sampling import RandomOverSampler

x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(x_trainvec, y_train)   

print(x_trainvec_rand)

不知怎的，它起作用了。
你知道为什么吗？
文件说明：

fit_resample(X, y)[source]
Resample the dataset.

Parameters
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.

y : array-like of shape (n_samples,)
Corresponding label for each sample in X.

scipy

来源：https://stackoverflow.com/questions/73431942/why-does-the-pandas-dataframe-version-of-a-sparse-matrix-not-work-with-randomove

1条答案

按热度按时间

uujelgoq1#

文件上说它接受

X : {array-like, dataframe, sparse matrix}

这是sparse matrix，而不是稀疏 Dataframe 。在imbalaced-learn源代码中，我发现测试表明稀疏类型必须是csr或csr，但无法进行进一步处理。
但让我们看看Pandas稀疏。
稀疏矩阵：

In [105]: M = sparse.csr_matrix(np.eye(3))
In [106]: M
Out[106]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [107]: print(M)
  (0, 0)    1.0
  (1, 1)    1.0
  (2, 2)    1.0

派生的 Dataframe ：

In [108]: df = pd.DataFrame.sparse.from_spmatrix(M)
In [109]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype             
---  ------  --------------  -----             
 0   0       3 non-null      Sparse[float64, 0]
 1   1       3 non-null      Sparse[float64, 0]
 2   2       3 non-null      Sparse[float64, 0]
dtypes: Sparse[float64, 0](3)
memory usage: 164.0 bytes
In [110]: df[1]
Out[110]: 
0    0.0
1    1.0
2    0.0
Name: 1, dtype: Sparse[float64, 0]
In [111]: df[1].values
Out[111]: 
[0, 1.0, 0]
Fill: 0
IntIndex
Indices: array([1], dtype=int32)

稀疏 Dataframe 存储与稀疏矩阵存储完全不同，它不是两类的简单合并。
我可能应该坚持查看错误的FULL回溯，

ValueError: Shape of passed values is (290210, 1), indices imply (290210, 52651)

至少它可能会给予我们/你知道它试图做什么。但另一方面，专注于文档实际上说了什么，而不是你想让它说什么，就足够了。

赞(0）回复(0）举报 2022-11-10

我来回答

scipy 为什么panda Dataframe版本的稀疏矩阵不能与imblearn的RandomOverSampler一起使用，而文档中却说它可以接受这两种方法？

1条答案

相关问题

热门标签

最新问答