python 为什么用shuffle调用KFold生成器会给予相同的索引？

hs1rzwqc 于 2022-12-02 发布在 Python

关注(0)|答案(2)|浏览(98)

With sklearn, when you create a new KFold object and shuffle is true, it'll produce a different, newly randomized fold indices. However, every generator from a given KFold object gives the same indices for each fold even when shuffle is true. Why does it work like this?
Example:

from sklearn.cross_validation import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(4, n_folds=2, shuffle = True)

for fold in kf:
    print fold

print '---second round----'

for fold in kf:
    print fold

Output:

(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))
---second round----#same indices for the folds
(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))

This question was motivated by a comment on this answer . I decided to split it into a new question to prevent that answer from becoming too long.

python

来源：https://stackoverflow.com/questions/34940465/why-does-calling-the-kfold-generator-with-shuffle-give-the-same-indices

2条答案

按热度按时间

y53ybaqx1#

具有相同KFold对象的新迭代将不会对索引进行重新洗牌，这只会在对象的示例化过程中发生。KFold()不会看到数据，但知道样本数，因此它使用该数据对索引进行洗牌。

if shuffle:
    rng = check_random_state(self.random_state)
    rng.shuffle(self.idxs)

每次调用生成器来迭代每个折叠的索引时，它将使用相同的混洗索引并以相同的方式划分它们。
看一下KFold _PartitionIterator(with_metaclass(ABCMeta))的基类的code，其中定义了__iter__。基类中的__iter__方法调用KFold中的_iter_test_indices来划分并产生每个折叠的训练和测试索引。

赞(0）回复(0）举报 2022-12-02

ctehm74n2#

使用新版本的sklearn，通过调用from sklearn.model_selection import KFold，KFold生成器使用shuffle给予不同的索引：

import numpy as np
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=3, shuffle=True)

print('---first round----')
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    
print('---second round----')
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

输出：

---first round----
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
---second round----
TRAIN: [0 1] TEST: [2 3]
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2 3] TEST: [1]

或者，下面的代码迭代生成相同的结果：

from sklearn.model_selection import KFold
np.random.seed(42)
data = np.random.choice([0, 1], 10, p=[0.5, 0.5])
kf = KFold(2, shuffle=True, random_state=2022)
list(kf.split(data))

输出：

[(array([0, 1, 6, 8, 9]), array([2, 3, 4, 5, 7])),
 (array([2, 3, 4, 5, 7]), array([0, 1, 6, 8, 9]))]

赞(0）回复(0）举报 2022-12-02

我来回答

python 为什么用shuffle调用KFold生成器会给予相同的索引？

2条答案

相关问题

热门标签

最新问答